Task 1: Exploratory data analysis
library(readxl)
library(Seurat)
library(tidyverse)
library(chameleon)
library(randomForest)
Preprocess metadata
Load scRNA-seq data object and metadata.
galen_aml <- readRDS("reanalyze-aml2019/Seurat_AML.rds")
donor_metadata <-
readxl::read_xlsx(
"ScienceDirect_files_21Nov2024_08-09-30.987/1-s2.0-S0092867419300947-mmc1.xlsx"
)
Load HSPC cell types annotated by backspin cluster and add to
metadata of Seurat object. From aml2019
github repo.
backspin_celltypes <-
read_tsv("aml2019/04 Random forest classifier/BM_6915cells.BackSPIN.txt")
backspin_celltypes <-
column_to_rownames(backspin_celltypes, var = "cell") %>% dplyr::rename(backspin_celltype = cluster)
rownames(backspin_celltypes) <- gsub("-", "\\.", rownames(backspin_celltypes))
galen_aml <- AddMetaData(galen_aml, metadata = backspin_celltypes)
head(galen_aml@meta.data)
head(donor_metadata)
Metadata included in Seurat object - orig.ident (patient ID + day
from diagnosis) - NumberOfReads - CyclingScore - CyclingBinary -
MutTranscripts (number of transcript for wt allele) - WtTranscripts
(number of transcript for mutated allele) - PredictionRefined
(classification of cells into malignant, healthy and uncertain) -
CellType (classification into 21 healthy and malignant cell types) -
nCount_RNA - nFeature_RNA
Remove cells and metadata associated to cell lines.
galen_aml$SampleClass <-
ifelse(
grepl("BM", galen_aml$orig.ident),
"HealthyDonor",
ifelse(
grepl("OCI.AML3|MUTZ3", galen_aml$orig.ident),
"CellLine",
"AmlDonor"
)
)
galen_aml_donors <- subset(galen_aml, SampleClass != "CellLine")
donor_metadata <- donor_metadata %>%
filter(!Sample %in% c("OCI-AML3", "MUTZ3"))
Correct value in blast count metadata. “1% (76% promono-cytes)” is
considered as blast count of 76% in the paper (Fig. 2b).
donor_metadata$`Blast count` <-
as.numeric(gsub("<5\\%", "0.05", gsub(
"<1\\%",
"0.01",
gsub(".*76\\%.*", "0.76", donor_metadata$`Blast count`)
)))
Warning: NAs introduced by coercion
donor_metadata$Age <- as.numeric(donor_metadata$Age)
Merge donor metadata into Seurat object metadata.
# match sample ID between Seurat object and metadata table
donor_metadata <- donor_metadata %>%
mutate(SampleId = gsub("CD", "", gsub(" ", "\\.", gsub(
"\\+", "p", gsub("-", "n", Sample)
)))) %>%
mutate(SampleId = ifelse(
grepl("BM", SampleId),
SampleId,
paste(SampleId, `Days from diagnosis`, sep = ".")
))
# merge into seurat object metadata, preserve order of cells
tmp <-
merge(
rownames_to_column(galen_aml_donors@meta.data),
donor_metadata,
by.x = "orig.ident",
by.y = "SampleId",
all.x = T
)
tmp <- column_to_rownames(tmp, "rowname")
galen_aml_donors@meta.data <- tmp[colnames(galen_aml_donors),]
Sex distribution
Tettero, J.M., Cloos, J. & Bullinger, L. Acute myeloid leukemia:
does sex matter?. Leukemia 38, 2329–2331 (2024). https://doi.org/10.1038/s41375-024-02435-z
AML is more frequent in males (ca. 4.5 per 100,000) vs females (ca.
3.0 per 100,000). A number of studies showing differences in treatment
response in males and females demonstrates the importance to consider
sex.
The van Galen dataset only has male healthy donors and is biased
towards male AML donors (62.5%). Slightly more pronounced when
considering AML samples (66.7%).
Only cells from healthy donors and genotyped malignant cells from AML
cells were used for training the malignant vs. normal classifier. I.e.,
all healthy cells in the training data were male.
This would usually result in a classification of all female cells as AML
cells. However, there is also a male AML samples with loss of Y
(AML707B), which makes sex more ambiguously linked to malignant status.
Furthermore, expression of sex-specific genes could be low/sparse, and
not all cells might be distinguishable by sex based on their
expression.
# donors
p1 <- donor_metadata %>%
select(Sample, Gender) %>%
mutate(Sample = gsub(" .*", "", Sample)) %>%
unique() %>%
mutate(Group = ifelse(grepl("BM", Sample), "Healthy", "AML")) %>%
group_by(Group, Gender) %>%
summarize(n_donors = n()) %>%
ggplot(aes(fill = Gender, y = n_donors, x = Group)) +
geom_bar(stat = "identity") +
theme_bw() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "None",
panel.grid = element_blank()
)
`summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
# samples
p2 <- donor_metadata %>%
select(Sample, `Days from diagnosis`, Gender) %>%
unique() %>%
mutate(Group = ifelse(grepl("BM", Sample), "Healthy", "AML")) %>%
group_by(Group, Gender) %>%
summarize(n_samples = n()) %>%
ggplot(aes(fill = Gender, y = n_samples, x = Group)) +
geom_bar(stat = "identity") +
theme_bw() +
theme(panel.grid = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1))
`summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
ggpubr::ggarrange(p1, p2, align = "h", widths = c(1, 1.3))

Cancer driver mutations
Plot which sample has which cancer driver gene mutated (see Fig.
2B).
AML722B BCOR in paper Fig. 2b but not in supplementary metadata.
All samples have a unique combination of mutations.
sample_mutation_matrix <- donor_metadata %>%
filter(
!`RHP Mutations` %in% c("Not performed", "None Detected", "Unknown", "NA") &
!is.na(`RHP Mutations`)
) %>%
mutate(driver_mutations = `RHP Mutations`) %>%
separate_longer_delim(cols = driver_mutations, "/// ") %>%
mutate(driver_mutations = gsub(" .*", "", driver_mutations)) %>%
# unique for samples with multiple mutations in same gene
unique() %>%
# count number of samples for each gene and sort by that
group_by(driver_mutations) %>%
mutate(n_samples = n()) %>%
arrange(desc(n_samples)) %>%
mutate(value = 1)
sample_mutation_matrix <- sample_mutation_matrix %>%
mutate(
driver_mutations = factor(
driver_mutations,
levels = unique(sample_mutation_matrix$driver_mutations),
ordered = T
),
Sample = factor(
Sample,
level = unique(sample_mutation_matrix$Sample),
ordered = T
)
)
# mutations
p1 <- sample_mutation_matrix %>%
ggplot(aes(x = driver_mutations, y = Sample)) +
geom_tile(fill = "darkred") +
theme_bw() +
theme(axis.text.x = element_text(
angle = 90,
vjust = 0.5,
hjust = 1
)) +
theme(panel.grid = element_blank())
# rearrangements
p2 <- sample_mutation_matrix %>%
ggplot(aes(x = `Common translocation`, y = Sample)) +
geom_tile(fill = "darkred") +
theme_bw() +
theme(axis.text.x = element_text(
angle = 90,
vjust = 0.5,
hjust = 1
)) +
theme(
panel.grid = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()
) +
ylab("")
ggpubr::ggarrange(
p1,
p2,
ncol = 2,
nrow = 1,
align = "h",
widths = c(4, 1)
)

Cell type composition
Cell types present in data.
unique(galen_aml_donors$CellType)
[1] HSC Prog earlyEry lateEry GMP ProMono Mono cDC pDC ProB Plasma T CTL B
[15] NK HSC-like Prog-like GMP-like ProMono-like Mono-like cDC-like
Levels: HSC Prog earlyEry lateEry GMP ProMono Mono cDC pDC ProB B Plasma T CTL NK HSC-like Prog-like GMP-like ProMono-like Mono-like cDC-like
New metadata column whether cell is a cancer celltype or healthy.
Color scheme to avoid rainbow colors.
cell_types <- unique(galen_aml_donors@meta.data$CellType)
cell_type_colors <-
sample(chameleon::distinct_colors(n = length(cell_types))$name,
size = length(cell_types))
cell_type_colors <- setNames(cell_type_colors, cell_types)
Create more “coarse” cell type labels. Lymphoid and erythroid
(non-malignant only), and Myeloid/HSC, Myeloid/HSC-like.
lymphoid_celltypes <- c("Plasma",
"NK",
"ProB",
"B",
"T",
"CTL",
"pDC")
galen_aml_donors@meta.data <- galen_aml_donors@meta.data %>%
mutate(IsCancerCelltype = ifelse(grepl("-like", CellType), T, F)) %>%
mutate(CellTypeGeneral = ifelse(
CellType %in% lymphoid_celltypes,
"Lymphoid",
ifelse(
IsCancerCelltype,
"Myeloid/HSPC-like",
ifelse(grepl("Ery", CellType), "Erythroid", "Myeloid/HSPC")
)
))
Visualize cell type proportions across samples.
galen_aml_donors@meta.data %>%
group_by(orig.ident) %>%
count(CellType) %>%
ggplot(aes(fill = CellType, y = n, x = orig.ident)) +
geom_bar(position = "fill", stat = "identity") +
scale_fill_manual(values = cell_type_colors[unique(galen_aml_donors@meta.data$CellType)]) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank())

galen_aml_donors@meta.data %>%
group_by(orig.ident) %>%
count(IsCancerCelltype) %>%
ggplot(aes(fill = IsCancerCelltype, y = n, x = orig.ident)) +
geom_bar(position = "fill", stat = "identity") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank())

galen_aml_donors@meta.data %>%
group_by(orig.ident) %>%
count(PredictionRefined) %>%
ggplot(aes(fill = PredictionRefined, y = n, x = orig.ident)) +
geom_bar(position = "fill", stat = "identity") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank())

galen_aml_donors@meta.data %>%
group_by(orig.ident) %>%
count(CellTypeGeneral) %>%
ggplot(aes(fill = CellTypeGeneral, y = n, x = orig.ident)) +
geom_bar(position = "fill", stat = "identity") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank())

Myeloid and HSPCs are the cells that are difficult to distinguish
between healthy and malignant in AML cells. This is the actual difficult
prediction task of the classifier. Visualize the proportion of these
cells in the samples.
galen_aml_donors@meta.data %>%
filter(CellTypeGeneral %in% c("Myeloid/HSPC", "Myeloid/HSPC-like")) %>%
group_by(orig.ident, CellTypeGeneral) %>%
summarize(n_cells = n()) %>%
ggplot(aes(x = orig.ident, y = n_cells, fill = CellTypeGeneral)) +
geom_bar(stat = "identity", position = "fill") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank())
`summarise()` has grouped output by 'orig.ident'. You can override using the `.groups` argument.

Proportion of cycling cells is not higher in AML samples at diagnosis
than in healthy samples. AML is a disease of differentiation. The
accumulation of blasts in the bone marrow is the problem, not hyper
proliferation, as in other cancers.
galen_aml_donors@meta.data %>%
filter(`Days from diagnosis` == "D0" |
SampleClass == "HealthyDonor") %>%
group_by(orig.ident, SampleClass) %>%
count(CyclingBinary) %>%
ggplot(aes(fill = CyclingBinary, y = n, x = orig.ident)) +
geom_bar(position = "fill", stat = "identity") +
facet_grid( ~ SampleClass, scales = "free_x", space = "free") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1), panel.grid = element_blank())

Genotype information
Classify cells based on their genotype information into malignant and
healthy cells.
I classify all cells with a mutant transcript detected as malignant
cells.
All other cells from AML samples with a wt allele detected could be
classified as healthy.
However, this is imperfect, especially for heterozygous dominant
negative mutations.
Furthermore, due to clonality, there might be cells with wt allele
detected and mut allele missed.
Both can lead to false positive healthy cells.
Therefore, only using cells from AML donors genotyped as mutant and
healthy cells from healthy donors in the classifier (as done by van
Galen).
“The second classifier is used for determining if a cell for which we
did not detect a mutant transcript is malignant or normal, based on its
similarity to normal and malignant cells (i.e., cells from healthy BM
and HSC to myeloid-like cells from tumor samples for which we detected
mutant transcripts).”
galen_aml_donors@meta.data <- galen_aml_donors@meta.data %>%
mutate(genotype = ifelse(
!is.na(MutTranscripts) &
MutTranscripts != "",
"malignant",
ifelse(
!is.na(WtTranscripts) &
WtTranscripts != "",
"healthy",
"not_detected"
)
))
Plot 1: Cells with mut detected, only wt detected or no coverage,
compared to expected blast count.
Plot 2: Cells usable to trian malignant vs. healthy classifier.
There are multiple samples and patients with no or almost no
malignant genotyped cells that can be used for training.
Therefore, the classifier needs to be generalizable to other samples and
mutations.
Overfitting to specific patients would be bad.
p1 <- galen_aml_donors@meta.data %>%
group_by(orig.ident, genotype) %>%
summarise(n_cells = n()) %>%
ungroup() %>%
ggplot(aes(fill = genotype, y = n_cells, x = orig.ident)) +
geom_bar(stat = "identity") +
theme_bw() +
theme(axis.text.x = element_blank(),
panel.grid = element_blank()) +
scale_x_discrete(drop = FALSE) +
xlab(element_blank())
`summarise()` has grouped output by 'orig.ident'. You can override using the `.groups` argument.
p2 <- galen_aml_donors@meta.data %>%
select(`Blast count`, orig.ident) %>%
unique() %>%
ggplot(aes(y = `Blast count`, x = orig.ident)) +
geom_bar(stat = "identity") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank()) +
scale_x_discrete(drop = FALSE) +
xlab(element_blank())
ggpubr::ggarrange(
p1,
p2,
align = "v",
heights = c(2, 1),
ncol = 1,
nrow = 2
)
Warning: Removed 6 rows containing missing values or values outside the scale range (`geom_bar()`).

p1 <- galen_aml_donors@meta.data %>%
mutate(Sample = as.factor(Sample)) %>%
filter(genotype == "malignant" |
SampleClass == "HealthyDonor") %>%
group_by(Sample, genotype) %>%
summarise(n_cells = n()) %>%
ungroup() %>%
ggplot(aes(fill = genotype, y = n_cells, x = Sample)) +
geom_bar(stat = "identity") +
theme_bw() +
scale_x_discrete(drop = FALSE) +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank()) + scale_x_discrete(drop = FALSE)
`summarise()` has grouped output by 'Sample'. You can override using the `.groups` argument.Scale for x is already present.
Adding another scale for x, which will replace the existing scale.
p2 <- galen_aml_donors@meta.data %>%
filter(genotype == "malignant" |
SampleClass == "HealthyDonor") %>%
group_by(orig.ident, genotype) %>%
summarise(n_cells = n()) %>%
ungroup() %>%
ggplot(aes(fill = genotype, y = n_cells, x = orig.ident)) +
geom_bar(stat = "identity") +
theme_bw() +
scale_x_discrete(drop = FALSE) +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank()) + scale_x_discrete(drop = FALSE)
`summarise()` has grouped output by 'orig.ident'. You can override using the `.groups` argument.Scale for x is already present.
Adding another scale for x, which will replace the existing scale.
ggpubr::ggarrange(
p1,
p2,
align = "v",
heights = c(1, 1),
ncol = 1,
nrow = 2
)

Concordance of genotype and van Galen celltype annotation in AML
samples. NB: Cells with only wt transcript detected (“healthy”) in AML
samples might not be actually healthy (heterozygosity, clonality).
galen_aml_donors_metadata <- galen_aml_donors@meta.data
table(
filter(galen_aml_donors_metadata, SampleClass != "HealthyDonor")$genotype,
filter(galen_aml_donors_metadata, SampleClass != "HealthyDonor")$CellType
)
HSC Prog earlyEry lateEry GMP ProMono Mono cDC pDC ProB B Plasma T CTL NK HSC-like Prog-like GMP-like ProMono-like Mono-like cDC-like
healthy 25 98 82 189 48 101 164 85 12 6 24 129 179 106 198 47 218 118 195 180 98
malignant 5 11 15 6 9 8 2 9 4 1 5 2 5 5 10 39 188 114 292 129 80
not_detected 409 444 403 872 329 413 2025 474 118 98 378 946 6201 966 1604 1850 3290 1591 929 2241 1890
Sex bias in cells that can be used to train classifier (mut
transcript detected from AML donor, or from healthy donor).
p1 <- galen_aml_donors@meta.data %>%
filter(genotype == "malignant" |
SampleClass == "HealthyDonor") %>%
mutate(class = ifelse(genotype == "malignant", "malignant", "healthy")) %>%
group_by(Gender, class) %>%
summarise(n_cells = n()) %>%
ungroup() %>%
ggplot(aes(x = class, y = n_cells, fill = Gender)) +
geom_bar(position = "fill", stat = "identity") +
# geom_bar(stat = "identity") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank())
`summarise()` has grouped output by 'Gender'. You can override using the `.groups` argument.
p2 <- galen_aml_donors@meta.data %>%
filter(genotype == "malignant" |
SampleClass == "HealthyDonor") %>%
mutate(class = ifelse(genotype == "malignant", "malignant", "healthy")) %>%
group_by(Gender, class) %>%
summarise(n_cells = n()) %>%
ungroup() %>%
ggplot(aes(x = Gender, y = n_cells, fill = class)) +
geom_bar(position = "fill", stat = "identity") +
# geom_bar(stat = "identity") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank())
`summarise()` has grouped output by 'Gender'. You can override using the `.groups` argument.
p1 | p2

Accuracy of the classifier
Check the accuracy of the van Galen classifier based on concordance
with clinical blast count. This has already been checked in the paper
and the answer is that the correlation is high.
# samples with unclear prediction
galen_aml_donors@meta.data %>%
filter(PredictionRefined == "unclear") %>%
pull(orig.ident) %>%
unique() %>% as.character()
[1] "AML314.D0" "AML314.D31" "AML371.D0" "AML371.D34" "AML722B.D0" "AML722B.D49" "AML997.D0" "AML997.D35"
pred_malignant_vs_blast_count <- galen_aml_donors@meta.data %>%
group_by(
orig.ident,
PredictionRefined,
`Blast count`,
`Cell number`,
Sample,
Gender,
# DriverMutations,
`Days from diagnosis`
) %>%
summarize(PredictionRefinedProportion = n() / `Cell number`) %>%
filter(PredictionRefined == "malignant") %>%
select(
`Blast count`,
PredictionRefinedProportion,
orig.ident,
Sample,
Gender,
# DriverMutations,
`Days from diagnosis`
) %>%
unique() %>%
mutate(`Blast count` = as.numeric(`Blast count`)) %>%
ungroup()
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in dplyr 1.1.0.
Please use `reframe()` instead.
When switching from `summarise()` to `reframe()`, remember that `reframe()` always returns an ungrouped data frame and adjust accordingly.`summarise()` has grouped output by 'orig.ident', 'PredictionRefined', 'Blast count', 'Cell number', 'Sample', 'Gender', 'Days from diagnosis'. You can override using the `.groups` argument.Adding missing grouping variables: `PredictionRefined`, `Cell number`
nrow(pred_malignant_vs_blast_count)
[1] 26
pred_malignant_vs_blast_count %>%
ggplot(aes(
x = PredictionRefinedProportion,
y = `Blast count`,
group = Sample,
color = Sample
)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
ggrepel::geom_text_repel(aes(label = orig.ident),
size = 3,
max.overlaps = Inf) +
theme_bw() +
coord_equal() +
theme(panel.grid = element_blank())

cor.test(
pred_malignant_vs_blast_count$PredictionRefinedProportion,
pred_malignant_vs_blast_count$`Blast count`
)
Pearson's product-moment correlation
data: pred_malignant_vs_blast_count$PredictionRefinedProportion and pred_malignant_vs_blast_count$`Blast count`
t = 12.296, df = 24, p-value = 7.529e-12
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8460991 0.9680065
sample estimates:
cor
0.9289867
If the sex bias in healthy and AML samples matters, then the
difference between predicted and observed blast count (error) should be
larger in females than males.
Predicted proportion of malignant cells agrees less with clinical
blast count for female samples. Markedly, female samples have an
overestimated proportion of malignant cell.
However, this could also be due to other factors covarying with Sex,
like mutated gene or blast count.
Median absolute error 10.8%(F), 3.1%(M).
pred_malignant_vs_blast_count_error <- pred_malignant_vs_blast_count %>%
mutate(error = PredictionRefinedProportion - `Blast count`,
absolute_error = abs(error),
relative_error = PredictionRefinedProportion / `Blast count`) %>%
pivot_longer(
cols = c(error, absolute_error, relative_error),
names_to = "error_type",
values_to = "error"
) %>%
group_by(error_type, Gender)
pred_malignant_vs_blast_count_error %>%
summarize(median_error = round(median(error), 3))
`summarise()` has grouped output by 'error_type'. You can override using the `.groups` argument.
symmetric_limits <- function (x)
{
max <- max(abs(x))
c(-max, max)
}
p1 <- pred_malignant_vs_blast_count_error %>%
filter(error_type %in% c("error", "absolute_error")) %>%
ggplot(aes(x = error_type, y = error * 100, fill = Gender)) +
geom_hline(yintercept = 0, linetype = 2) +
geom_boxplot(outlier.size = 0) +
geom_point(pch = 21, position = position_jitterdodge(jitter.width = 0.1)) +
scale_y_continuous(labels = scales::percent_format(scale = 1)) +
theme_bw() +
theme(panel.grid = element_blank(), legend.position = "None") +
ylab("Prediction error % blasts") +
scale_y_continuous(limits = symmetric_limits) +
xlab(element_blank()) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
p2 <- pred_malignant_vs_blast_count_error %>%
filter(error_type %in% c("relative_error")) %>%
ggplot(aes(x = error_type, y = log2(error), fill = Gender)) +
geom_hline(yintercept = 0, linetype = 2) +
geom_boxplot(outlier.size = 0) +
geom_point(pch = 21, position = position_jitterdodge(jitter.width = 0.1)) +
theme_bw() +
theme(panel.grid = element_blank()) +
ylab("log2 relative error % blasts") +
scale_y_continuous(limits = symmetric_limits) +
xlab(element_blank()) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggpubr::ggarrange(p1, p2, align = "h", widths = c(1, 1.2))

Data quality
According to manuscript, cells were filtered for >1000 UMI,
>500 genes and <20% mitochondrial+ribosomal transcripts.
The Rdata object contains filtered cells.
VlnPlot(galen_aml_donors, features = c("nCount_RNA", "nFeature_RNA"), ncol = 1, log = T, pt.size = 0)

Low dimensional embedding
AML and healthy donors
Normalize data by library size and log transform.
Identify 2000 highly variable genes.
Scale/z-score highly variable genes.
galen_aml_donors <- NormalizeData(galen_aml_donors, scale.factor = 10000)
galen_aml_donors <- FindVariableFeatures(galen_aml_donors)
galen_aml_donors <- ScaleData(galen_aml_donors)
Calculate PCA and check elbow plot for appropriate number of PCs for
UMAP embedding.
galen_aml_donors <- RunPCA(galen_aml_donors)
ElbowPlot(galen_aml_donors, ndims = 50) +
coord_cartesian(ylim = c(0, NA))

Calculate UMAP embedding.
galen_aml_donors <- RunUMAP(galen_aml_donors, dims = 1:15)
Check UMAP for technical and biological variation.
Higher read counts in erythrocyte progenitors.
# technical
FeaturePlot(galen_aml_donors, features = "nCount_RNA", pt.size = 0.005, max.cutoff = "q99") + coord_equal()

FeaturePlot(galen_aml_donors, features = "nFeature_RNA", pt.size = 0.005) + coord_equal()

# annotation biological
DimPlot(galen_aml_donors, group.by = "SampleClass", shuffle = T, pt.size = 0.005) + coord_equal()

DimPlot(galen_aml_donors, group.by = "IsCancerCelltype", shuffle = T, pt.size = 0.005) + coord_equal()

DimPlot(galen_aml_donors, group.by = "PredictionRefined", shuffle = T, pt.size = 0.005) + coord_equal()

DimPlot(galen_aml_donors, group.by = "CellType", shuffle = T, pt.size = 0.005, label = T, label.size = 3) + coord_equal()

DimPlot(galen_aml_donors, group.by = "orig.ident", shuffle = T, pt.size = 0.005, label = T, label.size = 2) + coord_equal() + NoLegend()

FeaturePlot(galen_aml_donors, features = "Age", pt.size = 0.005) + coord_equal()

DimPlot(subset(galen_aml_donors, SampleClass == "HealthyDonor"), group.by = "orig.ident", shuffle = T, pt.size = 0.005) + coord_equal()

# sex differences
DimPlot(galen_aml_donors, group.by = "Gender", shuffle = T, pt.size = 0.005) + coord_equal()

DimPlot(subset(galen_aml_donors, SampleClass == "AmlDonor"), group.by = "Gender", shuffle = T, pt.size = 0.005) + coord_equal()

DimPlot(subset(galen_aml_donors, SampleClass == "AmlDonor"), group.by = "Sample", shuffle = T, pt.size = 0.005, split.by = "Gender") + coord_equal()

DimPlot(galen_aml_donors, group.by = "Sample", shuffle = T, pt.size = 0.005, split.by = "Gender") + coord_equal()

Embedding healthy donors
Create an embedding of just the healthy samples to visualize the
normal hematopoietic trajectory.
Same procedure as above.
galen_aml_donors_healthy <- subset(galen_aml_donors, SampleClass == "HealthyDonor")
galen_aml_donors_healthy <- FindVariableFeatures(galen_aml_donors_healthy)
galen_aml_donors_healthy <- ScaleData(galen_aml_donors_healthy)
galen_aml_donors_healthy <- RunPCA(galen_aml_donors_healthy)
ElbowPlot(galen_aml_donors_healthy, ndims = 50) +
coord_cartesian(ylim = c(0, NA))

galen_aml_donors_healthy <- RunUMAP(galen_aml_donors_healthy, dims = 1:15)
DimPlot(galen_aml_donors_healthy, group.by = "CellType", shuffle = T, pt.size = 0.01, label = T, label.size = 3) +
coord_equal()

DimPlot(galen_aml_donors_healthy, group.by = "orig.ident", shuffle = T, pt.size = 0.01, label = T, label.size = 3) +
coord_equal() + NoLegend()

AML and healthy donors excluding sex-specific genes
SDE scores for all protein-coding genes in 45 tissues common to men
and women. Genes were analyzed by NOISeqBIO with scores of zero given
for genes with insignificant differential expression. Other genes have
SDE scores below zero for men-biased expression and above zero for
women-biased expression. (CSV 2205 kb)
Bone marrow is not available. Instead use whole blood and spleen.
sex_genes <- read_csv("gershoni_et_al/12915_2017_352_MOESM3_ESM.csv")
Rows: 18759 Columns: 46── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Gene
dbl (45): Adipose-Subcutaneous, Adipose-Visceral, Adrenal_Gland, Artery-Aorta, Artery-Coronary, Artery-Tibial, Bladder, Brain-Amygdala, Brain-Anterior_cingulate_cortex, Brain-Caudate, Brain-...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sex_genes <- sex_genes %>%
select(Gene, Spleen, Whole_Blood) %>%
filter(Spleen != 0 | Whole_Blood != 0) %>%
mutate(gene = gsub("_.*", "", Gene)) %>%
pull(gene)
sex_genes <- c(sex_genes, "XIST")
# make sure IDs match
sex_genes[!sex_genes %in% rownames(galen_aml_donors)]
character(0)
length(sex_genes)
[1] 28
Also exclude genes expressed on Y chromosome. 149 genes present in
scRNA-seq data.
y_chrom_genes <- read_lines("reference_sources/gencode.v44_gene_names_chrY.txt")
y_chrom_genes <- y_chrom_genes[y_chrom_genes %in% rownames(galen_aml_donors)]
length(y_chrom_genes)
[1] 149
sex_genes <- unique(c(sex_genes, y_chrom_genes))
6 out of 2000 highly variable genes (AML+healthy) have sex-specific
expression.
I don’t think removal of all genes that are differentially expressed
betwen M and F should be remove, because there variables that co-vary
with sex.
# sex-specific genes used for embedding
sex_genes[sex_genes %in% VariableFeatures(galen_aml_donors)]
[1] "EIF1AY" "MAP7D2" "VEGFA" "XIST" "SHOX" "ASMT"
Re-calculate PCA, excluding sex-specific genes from HVG list.
galen_aml_donors <-
RunPCA(
galen_aml_donors,
features = VariableFeatures(galen_aml_donors)[!VariableFeatures(galen_aml_donors) %in% sex_genes],
reduction.name = "pca_no_sex"
)
ElbowPlot(galen_aml_donors, ndims = 50, reduction = "pca_no_sex") +
coord_cartesian(ylim = c(0, NA))

galen_aml_donors <-
RunUMAP(
galen_aml_donors,
dims = 1:15,
reduction = "pca_no_sex",
reduction.name = "umap_no_sex"
)
I would check if the integration of sexes has improved if
CalcAlignmentMetric was still available in Seurat v5…
MixingMetric has replaced CalcAlignmentMetric (Seurat v2).
Explanation of MixingMetric
- Look at the k.max nearest neighbors for a cell across all
datasets.
- Compute the k closest neighbors for each dataset individually for
the same cell.
- Find the rank in the overall neighborhood list (from 1.) that
corresponds to the kth neighbor in each dataset (max of k.max) and take
the median across all datasets.
- Compute 1-3 for every cell and average. This average is the value
that is returned by the MixingMetric function. However, it’s usually
more intuitive to think of high values of “mixing” to be better so for
visualization, we often plot max.k - MixingMetric.
Deafult max.k = 300.
According to this metric, mixing of cells from male and female AML
donors has slightly improved.
mixing_aml_sexes <- MixingMetric(
subset(galen_aml_donors, SampleClass == "AmlDonor"),
reduction = "pca",
dims = 1:15,
grouping.var = "Gender"
)
mixing_aml_sexes_corrected <- MixingMetric(
subset(galen_aml_donors, SampleClass == "AmlDonor"),
reduction = "pca_no_sex",
dims = 1:15,
grouping.var = "Gender"
)
300 - mean(mixing_aml_sexes)
[1] 253.2416
300 - mean(mixing_aml_sexes_corrected)
[1] 254.2387
# sex differences
DimPlot(
subset(galen_aml_donors, SampleClass == "AmlDonor"),
reduction = "umap_no_sex",
group.by = "Gender",
shuffle = T,
pt.size = 0.005
) + coord_equal()

Task 2: ML classifier malignant cells
Task Description
An interesting feature of the dataset is the availability of labels
distinguishing malignant and non-malignant single cells. The second task
is to exploit these labels to build a suitable ML classifier that
predicts the normal or malignant identity of single cells. Please detail
on how your classifier works, whether it is interpretable, its
performance, and the train-test split you use for your machine learning
experiment.
Considerations for the task
The tricky part is not training a classifier, but to consider biology
and technology to train a good classifier.
I assume that “labels distinguishing malignant and non-malignant
single cells” in the task description refers to the genotyped cells, and
not the labels predicted by van Galen et al. based on their random
forest classifiers.
Van Galen et al. train two random forest classifiers (see below), in
order to classify cells from AML donors into malignant and normal
cells.
The first classifier is for HSPC cell types and the second for HSPC cell
types + malignant vs. normal. Each classifier consists of an “outer”
classifier for feature selection and an “inner” classifier that is used
for prediction. Due to very limited number of genotyped cells that can
be used for training, they do not hold out a test set. Instead, to
evaluate their model performance, they perform 5-fold cross validation
of the “inner” classifier.
Finally, they use the model trained on all available data for
classification of AML cells that were not genotyped.
I assume that at the time there was no independent dataset available
with genotyped AML cells to test for generalizability of the model to
other scRNA-seq technologies (droplet-based, 5’ sequencing), human
populations, or driver mutations.
Points of critique
- Including all training data in the feature selection makes the CV
results of the “inner” classifier less informative.
- The CV is supposed to evaluate over- and underfitting. To estimate
overfitting (complexity of the model), the model should be evaluated on
how well it extrapolates to another donor and to another driver
mutation, not a random sample of all training cells.
- All healthy cells in the training data are male. Van Galen do not
address this by removing sex-specific genes from the features. (I’m
honestly suprised that the classifier doesn’t perform even worse on
female samples based on blast count than it does. A male sample with
loss of y might play a role, and expression level/sparsity of
sex-specific genes.)
- Cells were genotyped based on transcripts of following genes:
TET2, SETD2, PTPN11, NRAS, SF3A1, BCOR, KIT, RAD21, TP53, BRCC3, RUNX1,
FLT3, IDH2, DNMT3A, SMC3, NPM1, KRAS.
Cells with higher expression of these genes are more likely to be
genotyped and make it into the training data. Their expression should be
excluded from the feature list, also to make it more generalisable to
other driver mutations.
- The classifier does not consider possible differences in
transcription between non-malignant HSPCs from AML donors and healthy
donors. Non-malignant cells from AML donors have possibly altered
transcription due to changes in the environment in the bone
marrow.
The classifier only considers healthy cells from healthy donors, and not
non-malignant cells from AML donors. This is a limitation of the
technology, because cells with wt transcript detected but no mutated
transcript detected cannot confidently be classified as normal, due to
clonality and heterozygosity.
However, there might be a subset of cells from donors with known
zygosity and clonal structure where this is possible.
What I agree with, is the observation that, generally, malignant HSCs
are transcriptionally more similar to healthy HSCs than to malignant
monocyte progenitors. Therefore, it makes sense to guide the model with
information on similarity towards healthy HSPC cell types.
Random forest classifiers allow the extraction of feature importance
metrics (e.g. number of trees in which a feature is used). However, one
has to differ between a minimal optimal subset and an all
relevant subset of genes. This is especially relevant for
gene-expression based classifiers due to redundance (think gene
modules/signatures). See Kursa
et al. 2014 for more details.
Van Galen et al. select as features the genes chosen most often in
the outer classifier, across 1000 trees. This does not seem to consider
classes.
Ideas
- Instead of predicting HSC and HSC-like cells, why not give a
similarity score towards each healthy cell type as input feature. This
would allow to inform the classifier for relevant features like
combinations of signatures (e.g. LSCs arising from GMPs with mixed
HSC/GMP signature).
- Why include lymphoid and erythroid cell types in the malignant
vs. non-malignant classifier? They are transcriptionally very distinct
and in AML they are not malignant. The classifier might have an easier
time if they are excluded.
- Consider other methods to assign the most similar healthy cell type,
such as label transfer via projection onto the healthy reference
(Azimuth).
- Use an independent test set of genotypes AML cells.
Van Galen classifier malignant vs. normal AML
cells
Transcriptional heterogeneity of malignant and healthy cells
The goal is to develop a classifier of malignant vs. healthy HSPCs
and myeloid cells from AML patients, that is generalizable to other AML
types with other driver mutations and other scRNA-seq
technologies/datasets.
Evaluate the transcriptional heterogeneity of AML subtypes and
malignant vs. healthy cells.
Q1: Do healthy cells differ from malignant cells based on their
transcriptome?
Q2: Is the malignant transcriptional profile specific to driver
mutations?
We are only interested in myeloid and HSPC cells here, because
distinction of erythroid and lymphoid cells is relatively easy. Those
cells are not malignant in acute myeloid leukemia.
For a start, use HSPCs + myeloid cells, according to van Galen cell
type annotation.
Include healthy donors, to see if the cells genotyped as wt group
with definitive healthy cells.
selected_samples <- donor_metadata %>%
# filter(
# !grepl("BM", Sample),
# # `Blast count` > 0.1,
# # `Days from diagnosis` == "D0"
# ) %>%
pull(SampleId)
galen_aml_donors_myeloid_hspc <-
subset(
galen_aml_donors,
subset = (orig.ident %in% selected_samples & SampleClass == "HealthyDonor" |
((CellTypeGeneral == "Myeloid/HSPC-like" |
CellTypeGeneral == "Myeloid/HSPC") & genotype != "not_detected"))
)
table(galen_aml_donors_myeloid_hspc$genotype)
healthy malignant not_detected
1734 886 6558
# only keep SampleId-genotype combinations with at least 10 cells
samples_enough_cells <- names(which(table(galen_aml_donors_myeloid_hspc$orig.ident) >= 10))
galen_aml_donors_myeloid_hspc <-
subset(
galen_aml_donors_myeloid_hspc,
subset = orig.ident %in% samples_enough_cells
)
galen_aml_donors_myeloid_hspc@meta.data <- galen_aml_donors_myeloid_hspc@meta.data %>%
mutate(
SampleId_genotype =
paste(
galen_aml_donors_myeloid_hspc$orig.ident,
galen_aml_donors_myeloid_hspc$genotype,
sep = "_"
)
) %>%
mutate(SampleId_genotype = ifelse(SampleClass == "HealthyDonor", gsub("not_detected", "healthy", SampleId_genotype), SampleId_genotype))
pb_counts <-
Seurat::AggregateExpression(
galen_aml_donors_myeloid_hspc,
assays = "RNA",
slot = "counts",
# group.by = "SampleId_CellTypeGeneral"
group.by = "SampleId_genotype"
# group.by = "orig.ident"
)$RNA
Names of identity class contain underscores ('_'), replacing with dashes ('-')
group_anno <- colnames(pb_counts)
group_anno <- gsub(".*-", "", group_anno)
pb_dge <- edgeR::DGEList(
counts = pb_counts,
samples = colnames(pb_counts),
group = group_anno
)
Filter out samples with low read count (=few filtered cells).
Keep samples with read counts > 50,000, which is all samples.
summary(pb_dge$samples$lib.size)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3670 47074 195297 923075 618775 15170692
keep.samples <- pb_dge$samples$lib.size > 5e4
table(keep.samples)
keep.samples
FALSE TRUE
10 27
# pb_dge <- pb_dge[, keep.samples]
Filter out lowly expressed genes.
Keep genes with at least 10 CPM in at least 2 samples.
# genes with at least 1 CPM in at least 2 samples
keep.genes <- rowSums(edgeR::cpm(pb_dge) >=10) >= 2
table(keep.genes)
keep.genes
FALSE TRUE
15257 12642
pb_dge <- pb_dge[keep.genes, , keep=FALSE]
TMM normalization is performed to estimate effective library
size.
pb_dge <- edgeR::calcNormFactors(pb_dge, method = "TMM")
Calculate Multidimensional scaling (MDS) plot of distances between
gene expression profiles.
pb_mds <- limma::plotMDS(pb_dge, plot = F)
pb_mds_df <- data.frame(
list(
x = pb_mds$x,
y = pb_mds$y,
lib.size = pb_dge$samples$lib.size,
sample = gsub("-.*", "", colnames(pb_dge)),
group = pb_dge$samples$group
)
) %>%
merge(donor_metadata, by.x = "sample", by.y = "SampleId", all.x = T)
Check that major source of variation is not technical (sample size)
or irrelevant biological information in this case (blast count).
xlab_text <- paste0(
"Leading logFC dim 1 (",
round(pb_mds$var.explained[1] * 100, 1),
"%)"
)
ylab_text <- paste0(
"Leading logFC dim 2 (",
round(pb_mds$var.explained[2] * 100, 1),
"%)"
)
mds_theme <- theme_bw() +
theme(panel.grid = element_blank())
pb_mds_df %>%
unique() %>%
ggplot(aes(x, y, label = sample)) +
geom_point(aes(color = log(lib.size))) +
# ggrepel::geom_text_repel() +
xlab(xlab_text) + ylab(ylab_text) +
mds_theme

pb_mds_df %>%
ggplot(aes(x, y, label = sample)) +
geom_point(aes(color = `Blast count`)) +
# ggrepel::geom_text_repel() +
xlab(xlab_text) + ylab(ylab_text) +
mds_theme

pb_mds_df %>%
ggplot(aes(x, y, label = sample)) +
geom_point(aes(color = `Days from diagnosis` == "D0")) +
# ggrepel::geom_text_repel() +
xlab(xlab_text) + ylab(ylab_text) +
mds_theme

pb_mds_df %>%
ggplot(aes(x, y, label = sample)) +
geom_point(aes(color = group)) +
# ggrepel::geom_text_repel() +
xlab(xlab_text) + ylab(ylab_text) +
mds_theme

pb_mds_df %>%
ggplot(aes(x, y, label = sample)) +
geom_point(aes(color = Gender)) +
# ggrepel::geom_text_repel() +
xlab(xlab_text) + ylab(ylab_text) +
mds_theme

pb_mds_df %>%
ggplot(aes(x, y, label = sample)) +
geom_point(aes(color = Sample)) +
ggrepel::geom_text_repel(
size = 2,
min.segment.length = 0,
max.overlaps = Inf
) +
xlab(xlab_text) + ylab(ylab_text) +
mds_theme +
theme(legend.position = "None")

Check if the samples split by mutation.
sampleId_mutation_list <- donor_metadata %>%
select(SampleId, `RHP Mutations`, `Common translocation`) %>%
filter(SampleId %in% pb_mds_df$sample) %>%
pivot_longer(cols = c(`RHP Mutations`, `Common translocation`), values_to = "driver_mutation") %>%
select(-name) %>%
filter(!driver_mutation %in% c("Not performed", "None Detected", "Unknown", "NA") & !is.na(driver_mutation)) %>%
separate_longer_delim(cols = driver_mutation, "/// ") %>%
mutate(driver_mutation = gsub(" .*", "", driver_mutation)) %>%
# unique for samples with multiple mutations in same gene
unique() %>%
mutate(value = T) %>%
pivot_wider(id_cols = SampleId, names_from = driver_mutation, values_fill = F) %>%
pivot_longer(cols = -SampleId, names_to = "driver_mutation")
pb_mds_df_driver_mutation <- merge(pb_mds_df, sampleId_mutation_list, by.x = "sample", by.y = "SampleId", all.x = T)
PC1 aligns with DNMTA3. Indication that DNMT3 has a distinct
transcriptional profile.
“DNMT3A mutations occur in approximately 20% of AML cases and are
associated with changes in DNA methylation. CDKN2B plays an important
role in the regulation of hematopoietic progenitor cells and DNMT3A
mutation is associated with CDKN2B promoter methylation.” https://pmc.ncbi.nlm.nih.gov/articles/PMC7106122/
KRAS, FLT3-ITD, NPM1, NRAS
pb_mds_df_driver_mutation %>%
filter(group == "malignant") %>%
ggplot(aes(x, y, label = sample)) +
geom_point(aes(color = value)) +
facet_wrap(~driver_mutation) +
xlab(xlab_text) + ylab(ylab_text) +
theme_bw() +
theme(panel.grid = element_blank()) +
coord_equal()

unique(sampleId_mutation_list$driver_mutation)
[1] "KRAS" "NRAS" "NOTCH2" "SF3A1" "CBFB-MYH11" "DNMT3A" "NPM1" "TET2" "FLT3-ITD"
[10] "CEBPA" "FLT3" "JAK3" "TP53" "RUNX1" "SETD2" "BCOR" "BCORL1" "NOTCH1"
[19] "SMC3" "ATM" "BRCC3" "KIT" "RAD21" "RUNX1-RUNX1T1"
genes_genotypes <-
c(
"NRAS",
"KRAS",
"SF3A1",
"NOTCH1",
"NOTCH2",
"SF3A1",
"CBFB",
"MYH11",
"NPM1",
"JAK3",
"DNMT3A",
"FLT3",
"TP53",
"SETD2",
"RUNX1",
"BCOR",
"BCORL1",
"PTPN11",
"SMC3",
"IDH2",
"TET2",
"BRCC3",
"KIT",
"RAD21",
"MLL",
"CEBPA"
)
Azimuth projection
Alternative to cell type identification with a random forest model is
label transfer from a reference atlas.
Perform label transfer from the Azimuth bone marrow reference atlas.
Instead of training a random forest classifier to determine the most
similar healthy equivalent of a cell.
Azimuth provides cell type labels at 2 levels, see interactive
reference.
For this step, we are not interested in the genes distinguishing the
different cell types because they are already well defined. So we don’t
need a RF classifier where we can extract feature importance
metrics.
The label transfer is very fast and easy and provides a well
established reference framework of cell type classifications and
hierarchies.
Reference downloaded from here.
library(Azimuth)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
Registered S3 method overwritten by 'SeuratDisk':
method from
as.sparse.H5Group Seurat
Attaching shinyBS
library(SeuratData)
# The RunAzimuth function can take a Seurat object as input
# overwrites miscellanous data, so saving as new object
galen_aml_donors_azimuth <- RunAzimuth(galen_aml_donors,
reference = "azimuth_reference/bonemarrowref.SeuratData/inst/azimuth/")
Add Azimuth labels to original seurat object.
galen_aml_donors <-
AddMetaData(
galen_aml_donors,
galen_aml_donors_azimuth@meta.data,
col.name = c("predicted.celltype.l1", "predicted.celltype.l2")
)
The function returns cell type labels on 2 levels, and 2 QC metrics,
a mapping score and a prediction score for each level.
It also returns the projection of the cells onto the reference UMAP
space.
DimPlot(galen_aml_donors_azimuth, group.by = "predicted.celltype.l1", shuffle = T, label = T) + coord_equal()
FeaturePlot(galen_aml_donors_azimuth, features = "predicted.celltype.l1.score") + coord_equal()
DimPlot(galen_aml_donors_azimuth, group.by = "predicted.celltype.l2", shuffle = T, label = T) + coord_equal()
FeaturePlot(galen_aml_donors_azimuth, features = "predicted.celltype.l2.score") + coord_equal()
FeaturePlot(galen_aml_donors_azimuth, features = "mapping.score") + coord_equal()
DimPlot(galen_aml_donors_azimuth, group.by = "SampleClass", shuffle = T) + coord_equal()
DimPlot(galen_aml_donors_azimuth, group.by = "CellType", shuffle = T, label = T) + coord_equal()
Compare Azimuth annotations with backspin cluster annotation,
followed by RF. Also compare with malignant/non malignant RF
annotation.
heatmap(table(galen_aml_donors$backspin_celltype, galen_aml_donors$predicted.celltype.l1), scale = NULL, Rowv = NA, Colv = NA)
heatmap(table(galen_aml_donors$backspin_celltype, galen_aml_donors$predicted.celltype.l2), scale = NULL, Rowv = NA, Colv = NA)
heatmap(table(galen_aml_donors$CellType, galen_aml_donors$predicted.celltype.l1), scale = NULL, Rowv = NA, Colv = NA)
heatmap(table(galen_aml_donors$CellType, galen_aml_donors$predicted.celltype.l2), scale = NULL, Rowv = NA, Colv = NA)
In the end, I realized that the Azimuth annotations are problematic,
becaue either very granular or useless because lumping everything into
HSPC.
Train classifier
backspin celltype annotation is available for 783 cells of 1,590 BM5
CD34+CD38–.
y <- factor(galen_aml_donors_healthy$backspin_celltype)
X <- galen_aml_donors_healthy@assays$RNA@layers$data
X <- X[,!is.na(y)]
rownames(X) <- rownames(galen_aml_donors_healthy)
y <- y[!is.na(y)]
# # combine HSC and Prog
# y[y %in% c("HSC", "Prog")] <- "HSC_Prog_"
Van Galen et al. select genes with average normalized expression
>= 0.01 (normalized to 10k reads). This is biased by class. Instead,
select genes expressed in 5% cells in at least 1 class (10141
genes).
# mean normalized expression >= 0.01
keep.genes <- rowMeans(exp(X) - 1) >= 0.01
table(keep.genes)
keep.genes
FALSE TRUE
13825 14074
# expressed in 1% cells
keep.genes.expressed <- rowSums((X > 0) / ncol(galen_aml_donors_healthy)) >= 0.01
table(keep.genes.expressed)
keep.genes.expressed
FALSE TRUE
16740 11159
# expressed in 1% cells in at least 1 class
classes <- unique(y)
keep.genes.class <- lapply(classes, FUN = function(class) {
genes <- names(which(rowSums((X[,y == class] > 0) / sum(y == class)) >= 0.05))
return(genes)
})
keep.genes.class <- rownames(X) %in% unlist(keep.genes.class)
table(keep.genes.class)
keep.genes.class
FALSE TRUE
17758 10141
X <- X[which(keep.genes.class),]
All cells are male. Remove sex-specific genes anyways.
keep.genes <- !rownames(X) %in% sex_genes
table(keep.genes)
keep.genes
FALSE TRUE
38 10103
X <- X[which(keep.genes),]
sampsize is the number of cells sampled from each class. By default
ceiling(.632*nrow(x)). This is one way to deal with
imbalanced data in RF by using balanced data sets for each tree, even if
the data is not balanced, which is made possible by bootstrap.
Train 100 trees because I’m impatient.
set.seed(123)
rf.celltype.outer <- randomForest(
x = t(X),
y = y,
sampsize = rep(50, length(unique(y))),
ntree = 200,
do.trace = F
)
rf.celltype.outer
Call:
randomForest(x = t(X), y = y, ntree = 200, sampsize = rep(50, length(unique(y))), do.trace = F)
Type of random forest: classification
Number of trees: 200
No. of variables tried at each split: 100
OOB estimate of error rate: 16.51%
Confusion matrix:
B cDC CTL earlyEry GMP HSC lateEry Mono NK pDC Plasma ProB Prog ProMono T class.error
B 109 0 1 0 0 0 0 2 0 0 0 0 0 0 1 0.03539823
cDC 2 240 2 1 2 1 0 16 2 6 0 0 0 9 0 0.14590747
CTL 1 0 199 1 0 0 0 0 12 0 0 1 0 0 47 0.23754789
earlyEry 0 0 1 407 6 29 94 0 0 0 1 11 67 1 2 0.34248788
GMP 1 4 2 2 455 6 0 1 0 0 1 2 38 19 1 0.14473684
HSC 2 0 4 3 1 1143 0 0 0 0 0 9 107 0 1 0.10000000
lateEry 0 0 0 2 0 1 257 0 0 0 0 0 1 1 0 0.01908397
Mono 1 6 1 0 0 0 2 540 0 0 0 0 0 17 0 0.04761905
NK 0 0 6 0 0 0 0 1 147 0 0 0 1 0 2 0.06369427
pDC 1 0 0 0 0 0 0 0 0 82 0 1 7 0 0 0.09890110
Plasma 0 0 0 0 0 0 0 0 0 1 68 0 0 0 0 0.01449275
ProB 2 0 0 0 0 2 0 0 0 1 0 184 3 0 1 0.04663212
Prog 0 0 0 15 68 307 0 0 1 1 0 12 751 0 1 0.35034602
ProMono 0 4 2 1 21 0 1 78 0 0 0 0 0 516 1 0.17307692
T 1 1 35 1 0 0 0 1 5 1 0 0 0 0 675 0.06250000
Select 1k features from 14074 genes.
bestvar is the variable used to split the node (0 if node is
terminal).
rf.celltype.outer.used1k <-
names(sort(table(
rownames(rf.celltype.outer$importance)[rf.celltype.outer$forest$bestvar[rf.celltype.outer$forest$bestvar !=
0]]
), decreasing = TRUE)[1:1000])
plot(rf.celltype.outer$importance[rf.celltype.outer.used1k, 1]) # correlates to importance measure

set.seed(123)
rf.celltype.inner <- randomForest(
x = t(X[rf.celltype.outer.used1k,]),
y = y,
sampsize = rep(50, length(unique(y))),
ntree = 200,
do.trace = F
)
rf.celltype.inner
Call:
randomForest(x = t(X[rf.celltype.outer.used1k, ]), y = y, ntree = 200, sampsize = rep(50, length(unique(y))), do.trace = F)
Type of random forest: classification
Number of trees: 200
No. of variables tried at each split: 31
OOB estimate of error rate: 13.49%
Confusion matrix:
B cDC CTL earlyEry GMP HSC lateEry Mono NK pDC Plasma ProB Prog ProMono T class.error
B 106 0 1 0 0 0 0 2 0 0 0 0 0 0 4 0.06194690
cDC 3 249 0 1 2 1 0 3 0 6 0 1 2 13 0 0.11387900
CTL 1 0 220 0 0 0 0 0 12 0 1 0 0 0 27 0.15708812
earlyEry 2 1 2 435 13 18 71 0 0 0 1 11 63 0 2 0.29725363
GMP 1 2 1 3 456 4 0 0 0 3 1 1 32 27 1 0.14285714
HSC 1 0 0 7 2 1142 0 0 0 1 0 16 100 0 1 0.10078740
lateEry 0 0 0 7 0 0 255 0 0 0 0 0 0 0 0 0.02671756
Mono 1 9 0 0 0 0 2 542 0 0 0 0 0 13 0 0.04409171
NK 0 0 2 0 0 0 0 0 154 0 0 0 0 0 1 0.01910828
pDC 0 1 0 0 0 1 0 0 0 86 0 1 2 0 0 0.05494505
Plasma 0 0 0 0 0 0 0 0 0 1 68 0 0 0 0 0.01449275
ProB 0 0 0 0 0 3 0 0 0 1 0 188 0 0 1 0.02590674
Prog 0 0 0 8 47 206 0 0 2 3 0 14 876 0 0 0.24221453
ProMono 1 5 1 3 16 0 0 71 0 0 1 0 0 525 1 0.15865385
T 1 1 33 0 0 0 0 0 4 1 0 0 0 0 680 0.05555556
y_pred_healthy <- predict(rf.celltype.inner, newdata = t(X), type = "prob")
rf.celltype.inner$confusion
B cDC CTL earlyEry GMP HSC lateEry Mono NK pDC Plasma ProB Prog ProMono T class.error
B 106 0 1 0 0 0 0 2 0 0 0 0 0 0 4 0.06194690
cDC 3 249 0 1 2 1 0 3 0 6 0 1 2 13 0 0.11387900
CTL 1 0 220 0 0 0 0 0 12 0 1 0 0 0 27 0.15708812
earlyEry 2 1 2 435 13 18 71 0 0 0 1 11 63 0 2 0.29725363
GMP 1 2 1 3 456 4 0 0 0 3 1 1 32 27 1 0.14285714
HSC 1 0 0 7 2 1142 0 0 0 1 0 16 100 0 1 0.10078740
lateEry 0 0 0 7 0 0 255 0 0 0 0 0 0 0 0 0.02671756
Mono 1 9 0 0 0 0 2 542 0 0 0 0 0 13 0 0.04409171
NK 0 0 2 0 0 0 0 0 154 0 0 0 0 0 1 0.01910828
pDC 0 1 0 0 0 1 0 0 0 86 0 1 2 0 0 0.05494505
Plasma 0 0 0 0 0 0 0 0 0 1 68 0 0 0 0 0.01449275
ProB 0 0 0 0 0 3 0 0 0 1 0 188 0 0 1 0.02590674
Prog 0 0 0 8 47 206 0 0 2 3 0 14 876 0 0 0.24221453
ProMono 1 5 1 3 16 0 0 71 0 0 1 0 0 525 1 0.15865385
T 1 1 33 0 0 0 0 0 4 1 0 0 0 0 680 0.05555556
earlyEry misclassified as lateEry and Prog. Prog misclassified as
HSC.
heatmap(rf.celltype.inner$confusion, Rowv = NA, Colv = NA, scale = NULL)

rf.celltype.inner$confusion %>%
as.data.frame() %>%
rownames_to_column("class") %>%
ggplot(aes(x = class, y = class.error)) +
geom_bar(stat = "identity")

Predict celltype of malignant AML cells.
“In four patients (AML314, AML371, AML722B and AML997), for which we
detected few mutant transcripts and few high quality cells, we could not
confidently assign malignant cells.”
I also exclude AML475 and AML420B because they also only have 7 and 6
cells genotyped as malignant (and upon first classification, those cells
were lymphoid cells).
galen_aml_donors_malignant <- subset(galen_aml_donors, subset = (genotype == "malignant" & SampleClass != "HealthyDonor"))
table(galen_aml_donors_malignant$Sample)
AML1012 AML210A AML328 AML329 AML371 AML419A AML420B AML475 AML556 AML707B AML722B AML916 AML921A AML997
15 45 34 92 6 146 6 7 377 40 1 21 143 6
selected_samples <-
donor_metadata$Sample[!donor_metadata$Sample %in% c("AML314", "AML371", "AML722B", "AML997", "AML475", "AML420B")]
galen_aml_donors_malignant <-
subset(galen_aml_donors_malignant, Sample %in% selected_samples)
X_malignant <- galen_aml_donors_malignant@assays$RNA@layers$data
rownames(X_malignant) <- rownames(galen_aml_donors_malignant)
X_malignant <- X_malignant[rf.celltype.outer.used1k,]
y_pred_malignant <- predict(rf.celltype.inner, newdata = t(X_malignant), type = "prob")
y_pred_max <- colnames(y_pred_malignant)[max.col(y_pred_malignant)]
There are cells classified as lymphocytes, despite detection of
mutated transcripts. Same for early erythrocytes, pDC.
table(y_pred_max)
y_pred_max
B cDC CTL earlyEry GMP HSC lateEry Mono NK pDC Plasma ProB Prog ProMono T
11 85 7 31 112 4 5 155 8 7 2 18 176 285 7
table(y_pred_max, galen_aml_donors_malignant$Sample)
y_pred_max AML1012 AML210A AML328 AML329 AML419A AML556 AML707B AML916 AML921A
B 0 0 0 2 2 0 1 0 6
cDC 1 8 1 5 9 1 0 1 59
CTL 0 0 0 0 0 5 0 1 1
earlyEry 2 5 10 3 7 0 2 2 0
GMP 4 4 1 51 8 3 21 0 20
HSC 0 0 2 0 0 0 1 0 1
lateEry 0 0 0 3 0 1 0 0 1
Mono 1 8 0 0 40 103 0 0 3
NK 0 0 2 0 0 2 1 0 3
pDC 0 0 0 1 1 0 0 0 5
Plasma 0 0 0 0 0 2 0 0 0
ProB 0 1 4 0 7 0 1 5 0
Prog 5 9 13 20 67 1 13 11 37
ProMono 2 10 0 6 4 257 0 0 6
T 0 0 1 1 1 2 0 1 1
t(round(t(
table(y_pred_max, galen_aml_donors_malignant$Sample)
) / as.vector(colSums(
table(y_pred_max, galen_aml_donors_malignant$Sample)
)), 2))
y_pred_max AML1012 AML210A AML328 AML329 AML419A AML556 AML707B AML916 AML921A
B 0.00 0.00 0.00 0.02 0.01 0.00 0.03 0.00 0.04
cDC 0.07 0.18 0.03 0.05 0.06 0.00 0.00 0.05 0.41
CTL 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.05 0.01
earlyEry 0.13 0.11 0.29 0.03 0.05 0.00 0.05 0.10 0.00
GMP 0.27 0.09 0.03 0.55 0.05 0.01 0.52 0.00 0.14
HSC 0.00 0.00 0.06 0.00 0.00 0.00 0.03 0.00 0.01
lateEry 0.00 0.00 0.00 0.03 0.00 0.00 0.00 0.00 0.01
Mono 0.07 0.18 0.00 0.00 0.27 0.27 0.00 0.00 0.02
NK 0.00 0.00 0.06 0.00 0.00 0.01 0.03 0.00 0.02
pDC 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.03
Plasma 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00
ProB 0.00 0.02 0.12 0.00 0.05 0.00 0.03 0.24 0.00
Prog 0.33 0.20 0.38 0.22 0.46 0.00 0.32 0.52 0.26
ProMono 0.13 0.22 0.00 0.07 0.03 0.68 0.00 0.00 0.04
T 0.00 0.00 0.03 0.01 0.01 0.01 0.00 0.05 0.01
There are some rather unexpected HSPC cell type predictions: early
erythroid cells and pDC cells. While lymphoid cell types are pretty
surely a misclassification, early ery and pDC can actually be mutated in
AML, according to literature.
https://www.mdpi.com/2072-6694/14/14/3375 Acute myeloid
leukemia (AML) with ≥2% plasmacytoid dendritic cells (pDC) has been
recently described as AML with pDC differentiation (pDC-AML)
characterized by pDC expansion with frequent RUNX1 mutations.
AML921A has 8% cells predicted as pDC and RUNX1 NM_001754 c.167T>C
p.L56S (63.5%, VUS)
In combination with the possible signature combinations in AML cells,
I want to keep those predictions.
However, there are too few cells to use them as classes in a
classifier. Instead, I’ll include them as features, alongside genes, and
predict malignant vs. normal classes.
I’ll only train 1 classifier, instead of inner and outer. This way,
validation of correct prediction of malignant cells is more
truthful.
X_healthy <- galen_aml_donors_healthy@assays$RNA@layers$data
X_healthy <- X_healthy[,!is.na(galen_aml_donors_healthy$backspin_celltype)]
rownames(X_healthy) <- rownames(galen_aml_donors_healthy)
# combine celltype prediction with gene expression
X_healthy_celltype <- rbind(X_healthy, t(y_pred_healthy))
X_malignant <- galen_aml_donors_malignant@assays$RNA@layers$data
rownames(X_malignant) <- rownames(galen_aml_donors_malignant)
# combine celltype prediction with gene expression
X_malignant_celltype <- rbind(X_malignant, t(y_pred_malignant))
X_celltype <- cbind(X_healthy_celltype, X_malignant_celltype)
dim(X_celltype)
[1] 27914 7828
X <- cbind(X_healthy, X_malignant)
dim(X)
[1] 27899 7828
y = factor(c(rep("healthy", ncol(X_healthy)), rep("malignant", ncol(X_malignant))))
table(y)
y
healthy malignant
6915 913
Keep cells expressed in >= 1% in at least one of the classes.
keep.genes.expressed <- rowSums((X > 0) / ncol(X)) >= 0.01
table(keep.genes.expressed)
keep.genes.expressed
FALSE TRUE
16569 11330
# expressed in 1% cells in at least 1 class
classes <- unique(y)
keep.genes.class <- lapply(classes, FUN = function(class) {
genes <- names(which(rowSums((X[,y == class] > 0) / sum(y == class)) >= 0.01))
return(genes)
})
keep.genes.class <- rownames(X) %in% unlist(keep.genes.class)
names(keep.genes.class) <- rownames(X)
table(keep.genes.class)
keep.genes.class
FALSE TRUE
15572 12327
X <- X[names(which(keep.genes.class)),]
X_celltype <- X_celltype[c(names(which(keep.genes.class)), colnames(y_pred_malignant)), ]
X_celltype_driver <- X_celltype[!rownames(X_celltype) %in% genes_genotypes, ]
Train 1 classifier with celltype probabilities as features and one
without and one without genes with driver mutations.
set.seed(123)
rf.malignant <- randomForest(
x = t(X),
y = y,
sampsize = rep(200, length(unique(y))),
ntree = 200,
do.trace = F
)
set.seed(123)
rf.malignant.celltype <- randomForest(
x = t(X_celltype),
y = y,
sampsize = rep(200, length(unique(y))),
ntree = 200,
do.trace = F
)
set.seed(123)
rf.malignant.celltype.driver <- randomForest(
x = t(X_celltype_driver),
y = y,
sampsize = rep(200, length(unique(y))),
ntree = 200,
do.trace = F
)
Including cell type probabilities gives a
rf.malignant
Call:
randomForest(x = t(X), y = y, ntree = 200, sampsize = rep(200, length(unique(y))), do.trace = F)
Type of random forest: classification
Number of trees: 200
No. of variables tried at each split: 111
OOB estimate of error rate: 8.23%
Confusion matrix:
healthy malignant class.error
healthy 6385 530 0.07664497
malignant 114 799 0.12486309
rf.malignant.celltype
Call:
randomForest(x = t(X_celltype), y = y, ntree = 200, sampsize = rep(200, length(unique(y))), do.trace = F)
Type of random forest: classification
Number of trees: 200
No. of variables tried at each split: 111
OOB estimate of error rate: 7.78%
Confusion matrix:
healthy malignant class.error
healthy 6419 496 0.07172813
malignant 113 800 0.12376780
rf.malignant.celltype.driver
Call:
randomForest(x = t(X_celltype_driver), y = y, ntree = 200, sampsize = rep(200, length(unique(y))), do.trace = F)
Type of random forest: classification
Number of trees: 200
No. of variables tried at each split: 110
OOB estimate of error rate: 8.01%
Confusion matrix:
healthy malignant class.error
healthy 6383 532 0.0769342
malignant 95 818 0.1040526
y_malignant_celltype_pred <- predict(rf.malignant.celltype, newdata = t(X_celltype), type = "prob")
y_malignant_celltype_pred_max <- colnames(y_malignant_celltype_pred)[max.col(y_malignant_celltype_pred)]
donor_list <- unique(galen_aml_donors_malignant$Sample)
names(donor_list) <- donor_list
# donor_accuracy_rf.malignant.celltype <- lapply(donor_list, function(donor) sum(y_malignant_celltype_pred_max[donors == donor] == "malignant") / sum(donors == donor))
# donor_accuracy_rf.malignant.celltype
Model evaluation
Cross-validation on AML donors
Pick 4 donors at random (to save time).
Accuracy is how many cells are correctly classified as malignant.
With this cross validation, we could do more paramter testing. Like
the number of features to pick for the inner cell type classifier. Or
how many paramters to sample for each decision in the tree. Or the depth
of the trees.
donors <- c(galen_aml_donors_healthy$Sample, galen_aml_donors_malignant$Sample)
table(donors)
donors
AML1012 AML210A AML328 AML329 AML419A AML556 AML707B AML916 AML921A BM1
15 45 34 92 146 377 40 21 143 108
BM2 BM3 BM4 BM5 CD34+ BM5 CD34+CD38-
188 643 3738 1431 807
donors_cv <- unique(galen_aml_donors_malignant$Sample, galen_aml_donors_healthy$Sample)
set.seed(123)
donors_cv <- sample(donors_cv, size = 4)
donors_cv
[1] "AML419A" "AML329" "AML707B" "AML210A"
cv_results <- lapply(donors_cv, function(donor) {
cat(donor)
cat("\n")
# train and validation split
X_celltype_cv_train <- X_celltype[, donors != donor]
X_celltype_cv_test <- X_celltype[, donors == donor]
y_cv_train <- y[donors != donor]
y_cv_test <- y[donors == donor]
# train on test data
set.seed(123)
rf.malignant.celltype.cv <- randomForest(
x = t(X_celltype_cv_train),
y = y_cv_train,
sampsize = rep(200, length(unique(y_cv_train))),
ntree = 200,
do.trace = FALSE
)
# predict validation data
y_cv_pred <-
predict(
rf.malignant.celltype.cv,
newdata = t(X_celltype_cv_test),
type = "prob"
)
y_cv_pred_max <- colnames(y_cv_pred)[max.col(y_cv_pred)]
# calculate accuracy
accuracy <- sum(y_cv_pred_max == y_cv_test) / length(y_cv_pred_max)
return(accuracy)
})
AML419A
AML329
AML707B
AML210A
names(cv_results) <- donors_cv
Cross validation when leaving out driver mutation genes.
cv_results_driver <- lapply(donors_cv, function(donor) {
cat(donor)
cat("\n")
# train and validation split
X_celltype_cv_train <- X_celltype_driver[, donors != donor]
X_celltype_cv_test <- X_celltype_driver[, donors == donor]
y_cv_train <- y[donors != donor]
y_cv_test <- y[donors == donor]
# train on test data
set.seed(123)
rf.malignant.celltype.cv <- randomForest(
x = t(X_celltype_cv_train),
y = y_cv_train,
sampsize = rep(200, length(unique(y_cv_train))),
ntree = 200,
do.trace = FALSE
)
# predict validation data
y_cv_pred <-
predict(
rf.malignant.celltype.cv,
newdata = t(X_celltype_cv_test),
type = "prob"
)
y_cv_pred_max <- colnames(y_cv_pred)[max.col(y_cv_pred)]
# calculate accuracy
accuracy <- sum(y_cv_pred_max == y_cv_test) / length(y_cv_pred_max)
return(accuracy)
})
AML419A
AML329
AML707B
AML210A
names(cv_results_driver) <- donors_cv
donor_metadata %>%
filter(Sample %in% donors_cv, `Days from diagnosis` == "D0") %>%
select(Sample, `RHP Mutations`, `Common translocation`) %>%
unique()
lapply(cv_results, round, 3)
$AML419A
[1] 0.842
$AML329
[1] 0.337
$AML707B
[1] 0.45
$AML210A
[1] 0.711
lapply(cv_results_driver, round, 3)
$AML419A
[1] 0.87
$AML329
[1] 0.391
$AML707B
[1] 0.225
$AML210A
[1] 0.778
Excluding genes with driver mutations increases patient CV accuracy
slightly in 3/4 cases.
Cross-validation on driver mutations
TODO
To test the generalizability of the model on unseen mutations, cross
validation should be performed with holding out all samples with a
specific driver mutation.
“To exclude the possibility that the high frequency of cells with
detected NPM1 mutations affected the classifier, we generated a separate
classifier that does not consider NPM1 mutant calls. This separate
classifier had equally high specificity (99.8% of normal cells correctly
called normal), and sensitivity (93% of malignant cells correctly called
malignant) in 5-fold cross-validation. It is also consistent with the
original classifier: 97% of cells originally classified as normal were
classified as normal; 91% of cells originally classified as malignant
were classified as malignant. These results indicate that the classifier
is robust to the frequency of NPM1 mutations in the training set.”
Should have put the NPM1 sample in the non-NPM1 trained
classifier.
External test dataset
scRNA-seq from AML patients (NPM1) with sc genotyping.
Naldini, M.M., Casirati, G., Barcella, M. et al. Longitudinal
single-cell profiling of chemotherapy response in acute myeloid
leukemia. Nat Commun 14, 1285 (2023). https://doi.org/10.1038/s41467-023-36969-0
10 patients with NPM1mut AML, present in van Galen dataset. Majority
of NPM1mut AML sampels in van Galen dataset also have DNMT3 mutation.
DNMT3 status is not mentioned for this study.
3 patients with del(7) AML, not present in van Galen data.
Preprocessing information Feature-barcodes filtered matrices
from Cell Ranger were used as input for Seurat R package64,65 (version
3.2.3). Seurat objects were merged in a single full dataset. Cells with
a mitochondrial count ratio higher than 0.2 and <100 or >7000
expressed genes were removed from the dataset. UMI counts were log
normalized and scaled for a factor of 10,000.
The top 20% most variable genes were selected for downstream analysis.
Cell cycle scores were assigned with the CellCycleScoring function using
a reference gene lists66. We scaled data and regressed out unwanted
variability by passing UMI count, percent of mitochondrial genes and
cell cycle difference defined variables to the vars.to.regress argument.
Cell cycle difference was defined as the difference between S phase and
G/2 M phase module scores. Downstream analysis was performed on the top
100 principal components (PCA). In order to reduce patient-related and
10x chemistry version (v2 vs v3) bias, we performed data integration
using the Harmony package (v1.0)24.
Mutation annotation information NPM1-MF considers the UMI
counts supporting either NPM1 mutant or WT allele to classify cells as
MUT (≥1 UMI for the mutant allele) WT (>5 WT transcripts, no mutant
transcripts) ND – not detected (cells with ≤ 5 WT transcripts and no
mutant transcripts) NoCall (cells without coverage over the NPM1
mutation region)
Check patient metadata.
Do not provide information on age, Sex or clinincal blast count.
metadata_npm1_patients <- readxl::read_xlsx("naldini_et_al/41467_2023_36969_MOESM5_ESM.xlsx")
metadata_npm1_patients
Check cell metadata.
I can’t figure out what the orig.ident (e.g. M03) is. It is the ID used
in the expression matrix. So maybe the capture, but it doesn’t make much
sense to have 9 cells from PT12 and 570 cells from PT13 in M44. Why
would there be leakage between captures.
PT01, PT02, PT13 are primary refractory. PT06, PT07, PT12 are
long-term complete remission. PT08, PT09, PT10, PT15 are NPM1mut AML
relapse post-chemotherapy.
metadata_npm1_cells <- readRDS("naldini_et_al/GSE185991_Full_Patient_metadata.rds")
head(metadata_npm1_cells)
Check number of genotyped cells per sample to pick one for
testing.
PT02, combining diagnosis and D30 samples, has a lot of WT and MUT
cells, NPM1mut, primary refractory. Good patient to test if classifier
can extrapolate to other scRNA-seq datasets.
metadata_npm1_patients %>%
mutate(Sample = paste(PatientID, Timepoint, sep = "_")) %>%
group_by(Sample, Classification) %>%
summarize(n_cells = sum(n)) %>%
ggplot(aes(x = Sample, y = n_cells, fill = Classification)) +
geom_bar(stat = "identity") +
theme_bw() +
theme(panel.grid = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1))
`summarise()` has grouped output by 'Sample'. You can override using the `.groups` argument.

Captures with PT02 data. Not sure why there are only 9 cells in the
other capture.
metadata_npm1_cells %>%
filter(PatientID == "PT02") %>%
pull(orig.ident) %>%
table()
.
M25 M26
1563 2152
metadata_npm1_cells %>%
filter(PatientID == "PT07") %>%
pull(orig.ident) %>%
table()
.
M21 M22 M23 M24
520 3706 603 6389
Read in count data.
naldini_seurat_m25 <- Read10X(
data.dir = "naldini_et_al/GSE185991_RAW/M25/",
gene.column = 2,
cell.column = 1,
unique.features = TRUE,
strip.suffix = FALSE
)
colnames(naldini_seurat_m25) <- paste0("M25_", gsub("-1", "", colnames(naldini_seurat_m25)))
naldini_seurat_m26 <- Read10X(
data.dir = "naldini_et_al/GSE185991_RAW/M26/",
gene.column = 2,
cell.column = 1,
unique.features = TRUE,
strip.suffix = FALSE
)
colnames(naldini_seurat_m26) <- paste0("M26_", gsub("-1", "", colnames(naldini_seurat_m26)))
# combine captures and merge
naldini_seurat_pt02 <- cbind(naldini_seurat_m25, naldini_seurat_m26)
naldini_seurat_pt02 <- CreateSeuratObject(counts = naldini_seurat_pt02, min.cells = 0, min.features = 200)
ncol(naldini_seurat_pt02)
[1] 4677
naldini_seurat_m25 <- NULL
naldini_seurat_m26 <- NULL
gc()
used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 14875814 794.5 22323122 1192.2 NA 22323122 1192.2
Vcells 5053453562 38554.8 9354963862 71372.8 102400 9327473519 71163.0
# subset to cells present in metadata
naldini_seurat_pt02 <- subset(naldini_seurat_pt02, cells = rownames(metadata_npm1_cells))
ncol(naldini_seurat_pt02)
[1] 3714
# add metadata to Seurat object
naldini_seurat_pt02 <- AddMetaData(naldini_seurat_pt02, metadata_npm1_cells)
table(naldini_seurat_pt02$orig.ident)
M25 M26
1563 2151
naldini_seurat_m21 <- Read10X(
data.dir = "naldini_et_al/GSE185991_RAW/M21/",
gene.column = 2,
cell.column = 1,
unique.features = TRUE,
strip.suffix = FALSE
)
colnames(naldini_seurat_m21) <- paste0("M21_", gsub("-1", "", colnames(naldini_seurat_m21)))
naldini_seurat_m22 <- Read10X(
data.dir = "naldini_et_al/GSE185991_RAW/M22/",
gene.column = 2,
cell.column = 1,
unique.features = TRUE,
strip.suffix = FALSE
)
colnames(naldini_seurat_m22) <- paste0("M22_", gsub("-1", "", colnames(naldini_seurat_m22)))
naldini_seurat_m23 <- Read10X(
data.dir = "naldini_et_al/GSE185991_RAW/M23/",
gene.column = 2,
cell.column = 1,
unique.features = TRUE,
strip.suffix = FALSE
)
colnames(naldini_seurat_m23) <- paste0("M23_", gsub("-1", "", colnames(naldini_seurat_m23)))
naldini_seurat_m24 <- Read10X(
data.dir = "naldini_et_al/GSE185991_RAW/M24/",
gene.column = 2,
cell.column = 1,
unique.features = TRUE,
strip.suffix = FALSE
)
colnames(naldini_seurat_m24) <- paste0("M24_", gsub("-1", "", colnames(naldini_seurat_m24)))
# combine captures and merge
naldini_seurat_pt07 <- cbind(naldini_seurat_m21, naldini_seurat_m22, naldini_seurat_m23, naldini_seurat_m24)
naldini_seurat_pt07 <- CreateSeuratObject(counts = naldini_seurat_pt07, min.cells = 0, min.features = 200)
ncol(naldini_seurat_pt07)
[1] 13396
naldini_seurat_m21 <- NULL
naldini_seurat_m22 <- NULL
naldini_seurat_m23 <- NULL
naldini_seurat_m24 <- NULL
gc()
used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 14878898 794.7 22323122 1192.2 NA 22323122 1192.2
Vcells 5073623895 38708.7 9354963862 71372.8 102400 9327473519 71163.0
# subset to cells present in metadata
naldini_seurat_pt07 <- subset(naldini_seurat_pt07, cells = rownames(metadata_npm1_cells))
ncol(naldini_seurat_pt07)
[1] 11216
# add metadata to Seurat object
naldini_seurat_pt07 <- AddMetaData(naldini_seurat_pt07, metadata_npm1_cells)
table(naldini_seurat_pt07$orig.ident)
M21 M22 M23 M24
520 3704 603 6389
Filter cells for >3000 UMI, 1000 genes, <10% mitochondrial
transcripts PT02. PT07 seems to have lower quality. Use less stringent
filters.
VlnPlot(naldini_seurat_pt02, features = c("nCount_RNA", "nFeature_RNA", "percent.mt"), log = T, pt.size = 0)
Warning: Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.

VlnPlot(naldini_seurat_pt07, features = c("nCount_RNA", "nFeature_RNA", "percent.mt"), log = T, pt.size = 0)
Warning: Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.

Seems like PT02 is female, PT07 is male.
VlnPlot(naldini_seurat_pt02, features = c("XIST"))
Warning: Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.

VlnPlot(naldini_seurat_pt07, features = c("XIST"))
Warning: Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.Warning: All cells have the same value of XIST.

naldini_seurat_pt02 <-
subset(naldini_seurat_pt02,
subset = nCount_RNA > 3000 & nFeature_RNA > 1000 & percent.mt < 10)
naldini_seurat_pt07 <-
subset(naldini_seurat_pt07,
subset = nCount_RNA > 1000 & nFeature_RNA > 500 & percent.mt < 10)
table(naldini_seurat_pt02$orig.ident)
M25 M26
1347 1916
table(naldini_seurat_pt07$orig.ident)
M21 M22 M23 M24
359 2498 563 5736
naldini_seurat_pt02_pt07 <-
merge(naldini_seurat_pt02, naldini_seurat_pt07)
naldini_seurat_pt02_pt07
An object of class Seurat
33538 features across 12419 samples within 1 assay
Active assay: RNA (33538 features, 0 variable features)
2 layers present: counts.1, counts.2
naldini_seurat_pt02_pt07 <-
merge(
naldini_seurat_pt02,
y = naldini_seurat_pt07,
add.cell.ids = c("PT02", "PT07"),
project = "Naldini"
)
naldini_seurat_pt02_pt07
An object of class Seurat
33538 features across 12419 samples within 1 assay
Active assay: RNA (33538 features, 0 variable features)
2 layers present: counts.1, counts.2
Normalize gene expression.
naldini_seurat_pt02_pt07 <- NormalizeData(naldini_seurat_pt02_pt07)
Normalizing layer: counts.1
Performing log-normalization
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Normalizing layer: counts.2
Performing log-normalization
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Predict HSPC cell type
X <-
cbind(
naldini_seurat_pt02_pt07@assays$RNA@layers$data.1,
naldini_seurat_pt02_pt07@assays$RNA@layers$data.2
)
rownames(X) <- rownames(naldini_seurat_pt02_pt07)
Some features used in classifier are missing in data, either due to
gene ID mismatches or because the genes are not expressed. Both van
Galen and Naldini were aligned to hg38, so it is more likely lack of
expression. Will set missing genes to 0.
features_missing <-
rownames(rf.celltype.inner$importance)[!rownames(rf.celltype.inner$importance) %in% rownames(X)]
features_missing_mat <-
matrix(nrow = length(features_missing),
ncol = ncol(X),
data = 0)
rownames(features_missing_mat) <- features_missing
colnames(features_missing_mat) <- colnames(X)
X <- rbind(X, features_missing_mat)
X <- X[rownames(rf.celltype.inner$importance), ]
naldini_seurat_pt02_pt07_celltype_pred <- predict(rf.celltype.inner, newdata = t(X), type = "prob")
Predict malignant cells
X <-
cbind(
naldini_seurat_pt02_pt07@assays$RNA@layers$data.1,
naldini_seurat_pt02_pt07@assays$RNA@layers$data.2
)
X <- rbind(X, t(naldini_seurat_pt02_pt07_celltype_pred))
rownames(X) <-
c(
rownames(naldini_seurat_pt02_pt07),
colnames(naldini_seurat_pt02_pt07_celltype_pred)
)
features_missing <-
rownames(rf.malignant.celltype$importance)[!rownames(rf.malignant.celltype$importance) %in% rownames(X)]
features_missing_mat <-
matrix(nrow = length(features_missing),
ncol = ncol(X),
data = 0)
rownames(features_missing_mat) <- features_missing
colnames(features_missing_mat) <- colnames(X)
X <- rbind(X, features_missing_mat)
naldini_seurat_pt02_pt07_malignant_pred <-
predict(rf.malignant.celltype,
newdata = t(X),
type = "prob")
Warning: sparse->dense coercion: allocating vector of size 1.1 GiB
naldini_seurat_pt02_pt07_malignant_pred_max <-
colnames(naldini_seurat_pt02_pt07_malignant_pred)[max.col(naldini_seurat_pt02_pt07_malignant_pred)]
Results
Compare malignant vs. healthy prediction with genotype data.
naldini_seurat_pt02_malignant_pred_max <- naldini_seurat_pt02_pt07_malignant_pred_max[naldini_seurat_pt02_pt07$PatientID == "PT02"]
table(
naldini_seurat_pt02_malignant_pred_max,
naldini_seurat_pt02@meta.data$Classification
)
naldini_seurat_pt02_malignant_pred_max MUT ND NoCall WT
healthy 9 410 38 578
malignant 1118 717 190 203
round(t(t(
table(
naldini_seurat_pt02_malignant_pred_max,
naldini_seurat_pt02@meta.data$Classification
)
) / as.vector(
table(naldini_seurat_pt02@meta.data$Classification)
)), 3)
naldini_seurat_pt02_malignant_pred_max MUT ND NoCall WT
healthy 0.008 0.364 0.167 0.740
malignant 0.992 0.636 0.833 0.260
naldini_seurat_pt07_malignant_pred_max <-
naldini_seurat_pt02_pt07_malignant_pred_max[naldini_seurat_pt02_pt07$PatientID == "PT07"]
table(
naldini_seurat_pt07_malignant_pred_max,
naldini_seurat_pt07@meta.data$Classification
)
naldini_seurat_pt07_malignant_pred_max MUT ND NoCall WT
healthy 10 2166 2894 121
malignant 775 922 1816 452
round(t(t(
table(
naldini_seurat_pt07_malignant_pred_max,
naldini_seurat_pt07@meta.data$Classification
)
) / as.vector(
table(naldini_seurat_pt07@meta.data$Classification)
)), 3)
naldini_seurat_pt07_malignant_pred_max MUT ND NoCall WT
healthy 0.013 0.701 0.614 0.211
malignant 0.987 0.299 0.386 0.789
Test on sample with a mutation not present in van Galen data.
del(7) AML: deletion in chromosome 7.
Classification of del(7) cells into malignant/wt:
“We leveraged the AddModuleScore function for evaluating the expression
level of a Chr 7 signature by using as input gene list all genes located
on it. The observed distribution of Chr 7 module scores in the datasets
followed a bimodal distribution allowing us to classify cells as AML or
non-AML by running a k-means clustering (n = 2) on the vector of Chr 7
signature scores and labelling cells in the high score group as non-AML
and those in the low score group as AML.”
Unfortunately, not included in metadata.
PT11, PT17: refractory disease; PT18: early relapse
metadata_del7_cells <- readRDS("naldini_et_al/GSE185991_AML_Del7_metadata.rds")
table(metadata_del7_cells$orig.ident, metadata_del7_cells$PatientID)
PT11 PT17 PT18
M36 6665 0 0
M38 1942 0 0
M84 0 7924 0
M85 0 1978 0
M88 0 0 5847
M89 0 0 3440
M92 0 5512 0
Computing module score with all genes on chr7 might be very slow.
Download cellranger gtf reference, get all genes on chr7.
source="/Users/holzehenrietta/Documents/personal/phd_applications/muds_task_marr/reference_sources"
mkdir -p "$source"
gtf_url="http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.primary_assembly.annotation.gtf.gz"
gtf_in="${source}/gencode.v44.primary_assembly.annotation.gtf"
if [ ! -f "$gtf_in" ]; then
curl -sS "$gtf_url" | zcat > "$gtf_in"
fi
grep "^chr7" $gtf_in | awk '$3 == "gene"' | sed 's/.*gene_name "\([^"]*\)".*/\1/g' > ${source}/gencode.v44_gene_names_chr7.txt
grep "^chrY" $gtf_in | awk '$3 == "gene"' | sed 's/.*gene_name "\([^"]*\)".*/\1/g' > ${source}/gencode.v44_gene_names_chrY.txt
chr7_genes <-
read_lines(
"/Users/holzehenrietta/Documents/personal/phd_applications/muds_task_marr/reference_sources/gencode.v44_gene_names_chr7.txt"
)
Subset to genes expressed in >= 10% of cells in del(7) scRNA-seq
data to speed up analysis.
---
title: "MUDS Marr Exercise"
output: 
  html_notebook: 
    toc: yes
    toc_depth: 4
author: Henrietta Holze
date: "21 November 2024"
---

## Task description

In a 2019 publication (doi:10.1016/j.cell.2019.01.031), Peter van Galen and colleagues analyzed transcriptomic profiles of bone marrow aspirates from 16 AML patients and five healthy donors.

Your first task in this exercise is to provide a concise overview of the dataset, including summary statistics such as the composition of cell-type frequencies across donors and a visualization of the data in two dimensions (e.g., using UMAP or t-SNE embeddings). 

An interesting feature of the dataset is the availability of labels distinguishing malignant and non-malignant single cells. The second task is to exploit these labels to build a suitable ML classifier that predicts the normal or malignant identity of single cells. 
Please detail on how your classifier works, whether it is interpretable, its performance, and the train-test split you use for your machine learning experiment.  


### Relevance of the problem

**Task 1**

It is important to get a good understanding of a dataset before training any kind of machine learning algorithm on it. 
Important factors to be aware of are biases in the data, in particular, biases in the patient metadata.  
Furthermore, the quality of the data has to be assessed, such as sample and cell quality.  
Lastly, transcriptomics data is highly dimensional and in order to recognize patterns in the data, it is useful to embed the data into a 2D space that can be visualized. 
Embedding of data into a lower dimensional space, e.g. via PCA, also serves to remove noise.

**Task 2**

Depending on disease burden, there are healthy and malignant cells present in the bone marrow of patients with AML.  
In order to investigate disease mechanisms of AML, we need to be able to clearly differentiate between healthy and malignant cells. 
Only then, we can look at what biological processes in cell differentiation are disrupted. 
The differentiation is also important to identify treatment targets that do not kill the remaining healthy HSPCs.  
Particularly in MRD samples it is very important to identify the subset of malignant cells that is resistant/tolerant to treatment.  
In AML, healthy and malignant progenitor and myeloid cells are not distinguishable by cell surface markers i.e., it is not possible to isolate healthy/malignant cells by FACS (yet).  
Thus, scRNA-seq data of AML BM samples contain both groups of cells and we have thus far not been able to distinguish them based on their transcriptome.

Van Galen et al. employed a novel technology to genotype a subset of cells present in the scRNA-seq data and used these genotyped cells to build a classifier to distinguish the bulk of the cells into malignant and non-malignant cells.  
However, sc genotyping comes at an extra cost and labor and is not available for all datasets. 
Therefore, it would be extremely useful to have a general classifier of normal vs. malignant cells in AML samples, that is applicable to other scRNA-seq datasets. 

### van Galen 2019 dataset

[Paper](https://www.cell.com/cell/fulltext/S0092-8674(19)30094-7)

[Original analysis of the data](https://github.com/BernsteinLab/aml2019)

[Re-analysis of data](https://github.com/petervangalen/reanalyze-aml2019)

Data downloaded with `wget https://www.dropbox.com/s/399x045zc57fiut/Seurat_AML.rds`

MUTZ-3 and OCI-AML3 are AML cell lines.

BM samples are from healthy donors. 
Cells from one healthy donor were sorted for CD34+ (HSPCs) and CD34+CD38- (HSCs), to enrich for rare early progenitor populations.

All other samples are from patients diagnosed with AML, with different driver mutations. 

AML samples were collected at day 0 and for a subset of samples at different time points post diagnosis (and post chemo).

## Task 1: Exploratory data analysis

```{r results=F}
library(readxl)
library(Seurat)
library(tidyverse)
library(chameleon)
library(randomForest)
```


<!-- ```{r} -->
<!-- # library("reticulate") -->
<!-- # py_install("python-igraph") -->
<!-- # py_install("leidenalg", forge = TRUE) -->
<!-- # install.packages("leiden") -->
<!-- ``` -->

### Preprocess metadata

Load scRNA-seq data object and metadata. 

```{r}
galen_aml <- readRDS("reanalyze-aml2019/Seurat_AML.rds")
donor_metadata <-
  readxl::read_xlsx(
    "ScienceDirect_files_21Nov2024_08-09-30.987/1-s2.0-S0092867419300947-mmc1.xlsx"
  )
```

Load HSPC cell types annotated by backspin cluster and add to metadata of Seurat object.
From [aml2019 github repo](https://github.com/BernsteinLab/aml2019/tree/master).

```{r results=F}
backspin_celltypes <-
  read_tsv("aml2019/04 Random forest classifier/BM_6915cells.BackSPIN.txt")

backspin_celltypes <-
  column_to_rownames(backspin_celltypes, var = "cell") %>% dplyr::rename(backspin_celltype = cluster)
rownames(backspin_celltypes) <- gsub("-", "\\.", rownames(backspin_celltypes))

galen_aml <- AddMetaData(galen_aml, metadata = backspin_celltypes)
```


```{r}
head(galen_aml@meta.data)
head(donor_metadata)
```

Metadata included in Seurat object
- orig.ident (patient ID + day from diagnosis)
- NumberOfReads
- CyclingScore
- CyclingBinary
- MutTranscripts (number of transcript for wt allele)
- WtTranscripts (number of transcript for mutated allele)
- PredictionRefined (classification of cells into malignant, healthy and uncertain)
- CellType (classification into 21 healthy and malignant cell types)
- nCount_RNA
- nFeature_RNA


Remove cells and metadata associated to cell lines.

```{r}
galen_aml$SampleClass <-
  ifelse(
    grepl("BM", galen_aml$orig.ident),
    "HealthyDonor",
    ifelse(
      grepl("OCI.AML3|MUTZ3", galen_aml$orig.ident),
      "CellLine",
      "AmlDonor"
    )
  )
galen_aml_donors <- subset(galen_aml, SampleClass != "CellLine")

donor_metadata <- donor_metadata %>%
  filter(!Sample %in% c("OCI-AML3", "MUTZ3"))
```

Correct value in blast count metadata. 
"1% (76% promono-cytes)" is considered as blast count of 76% in the paper (Fig. 2b).

```{r}
donor_metadata$`Blast count` <-
  as.numeric(gsub("<5\\%", "0.05", gsub(
    "<1\\%",
    "0.01",
    gsub(".*76\\%.*", "0.76", donor_metadata$`Blast count`)
  )))
donor_metadata$Age <- as.numeric(donor_metadata$Age)
```


Merge donor metadata into Seurat object metadata.

```{r}
# match sample ID between Seurat object and metadata table
donor_metadata <- donor_metadata %>%
  mutate(SampleId = gsub("CD", "", gsub(" ", "\\.", gsub(
    "\\+", "p", gsub("-", "n", Sample)
  )))) %>%
  mutate(SampleId = ifelse(
    grepl("BM", SampleId),
    SampleId,
    paste(SampleId, `Days from diagnosis`, sep = ".")
  ))

# merge into seurat object metadata, preserve order of cells
tmp <-
  merge(
    rownames_to_column(galen_aml_donors@meta.data),
    donor_metadata,
    by.x = "orig.ident",
    by.y = "SampleId",
    all.x = T
  )
tmp <- column_to_rownames(tmp, "rowname")
galen_aml_donors@meta.data <- tmp[colnames(galen_aml_donors),]
```


### Sex distribution

Tettero, J.M., Cloos, J. & Bullinger, L. Acute myeloid leukemia: does sex matter?. Leukemia 38, 2329–2331 (2024). [https://doi.org/10.1038/s41375-024-02435-z](https://www.nature.com/articles/s41375-024-02435-z/tables/1)
    
AML is more frequent in males (ca. 4.5 per 100,000) vs females (ca. 3.0 per 100,000).
A number of studies showing differences in treatment response in males and females demonstrates the importance to consider sex.

The van Galen dataset only has male healthy donors and is biased towards male AML donors (62.5%). 
Slightly more pronounced when considering AML samples (66.7%).

Only cells from healthy donors and genotyped malignant cells from AML cells were used for training the malignant vs. normal classifier. 
I.e., all healthy cells in the training data were male.  
This would usually result in a classification of all female cells as AML cells. However, there is also a male AML samples with loss of Y (AML707B), which makes sex more ambiguously linked to malignant status. 
Furthermore, expression of sex-specific genes could be low/sparse, and not all cells might be distinguishable by sex based on their expression.

```{r fig.height=3, fig.width=6}
# donors
p1 <- donor_metadata %>%
  select(Sample, Gender) %>%
  mutate(Sample = gsub(" .*", "", Sample)) %>%
  unique() %>%
  mutate(Group = ifelse(grepl("BM", Sample), "Healthy", "AML")) %>%
  group_by(Group, Gender) %>%
  summarize(n_donors = n()) %>%
  ggplot(aes(fill = Gender, y = n_donors, x = Group)) +
  geom_bar(stat = "identity") +
  theme_bw() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "None",
    panel.grid = element_blank()
  )

# samples
p2 <- donor_metadata %>%
  select(Sample, `Days from diagnosis`, Gender) %>%
  unique() %>%
  mutate(Group = ifelse(grepl("BM", Sample), "Healthy", "AML")) %>%
  group_by(Group, Gender) %>%
  summarize(n_samples = n()) %>%
  ggplot(aes(fill = Gender, y = n_samples, x = Group)) +
  geom_bar(stat = "identity") +
  theme_bw() +
  theme(panel.grid = element_blank(),
        axis.text.x = element_text(angle = 45, hjust = 1))

ggpubr::ggarrange(p1, p2, align = "h", widths = c(1, 1.3))
```


### Cancer driver mutations

Plot which sample has which cancer driver gene mutated (see Fig. 2B).  
AML722B BCOR in paper Fig. 2b but not in supplementary metadata.  
All samples have a unique combination of mutations. 

```{r}
sample_mutation_matrix <- donor_metadata %>%
  filter(
    !`RHP Mutations` %in% c("Not performed", "None Detected", "Unknown", "NA") &
      !is.na(`RHP Mutations`)
  ) %>%
  mutate(driver_mutations = `RHP Mutations`) %>%
  separate_longer_delim(cols = driver_mutations, "/// ") %>%
  mutate(driver_mutations = gsub(" .*", "", driver_mutations)) %>%
  # unique for samples with multiple mutations in same gene
  unique() %>%
  # count number of samples for each gene and sort by that
  group_by(driver_mutations) %>%
  mutate(n_samples = n()) %>%
  arrange(desc(n_samples)) %>%
  mutate(value = 1)

sample_mutation_matrix <- sample_mutation_matrix %>%
  mutate(
    driver_mutations = factor(
      driver_mutations,
      levels = unique(sample_mutation_matrix$driver_mutations),
      ordered = T
    ),
    Sample = factor(
      Sample,
      level = unique(sample_mutation_matrix$Sample),
      ordered = T
    )
  )

# mutations
p1 <- sample_mutation_matrix %>%
  ggplot(aes(x = driver_mutations, y = Sample)) +
  geom_tile(fill = "darkred") +
  theme_bw() +
  theme(axis.text.x = element_text(
    angle = 90,
    vjust = 0.5,
    hjust = 1
  )) +
  theme(panel.grid = element_blank())

# rearrangements
p2 <- sample_mutation_matrix %>%
  ggplot(aes(x = `Common translocation`, y = Sample)) +
  geom_tile(fill = "darkred") +
  theme_bw() +
  theme(axis.text.x = element_text(
    angle = 90,
    vjust = 0.5,
    hjust = 1
  )) +
  theme(
    panel.grid = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank()
  ) +
  ylab("")

ggpubr::ggarrange(
  p1,
  p2,
  ncol = 2,
  nrow = 1,
  align = "h",
  widths = c(4, 1)
)
```

<!-- Add driver mutations as metadata to Seurat object. -->

<!-- ```{r} -->
<!-- sample_aml_subtype_label <- sample_mutation_matrix %>% -->
<!--   group_by(Sample) %>% -->
<!--   mutate(aml_subtype = paste0(driver_mutations, collapse = ",")) %>% -->
<!--   mutate(aml_subtype = gsub(",NA", "", paste(aml_subtype, `Common translocation`, sep = ","))) %>% -->
<!--   select(Sample, aml_subtype) %>% -->
<!--   unique() -->

<!-- tmp <- -->
<!--   merge(rownames_to_column(galen_aml_donors@meta.data), -->
<!--         sample_aml_subtype_label, -->
<!--         by = "Sample") -->
<!-- tmp <- column_to_rownames(tmp, "rowname") -->

<!-- galen_aml_donors@meta.data$DriverMutations <- -->
<!--   tmp[colnames(galen_aml_donors),]$aml_subtype -->
<!-- ``` -->

### Cell type composition

Cell types present in data.

```{r}
unique(galen_aml_donors$CellType)
```

New metadata column whether cell is a cancer celltype or healthy.

Color scheme to avoid rainbow colors.

```{r}
cell_types <- unique(galen_aml_donors@meta.data$CellType)
cell_type_colors <-
  sample(chameleon::distinct_colors(n = length(cell_types))$name,
         size = length(cell_types))
cell_type_colors <- setNames(cell_type_colors, cell_types)
```

Create more "coarse" cell type labels. 
Lymphoid and erythroid (non-malignant only), and Myeloid/HSC, Myeloid/HSC-like. 

```{r}
lymphoid_celltypes <- c("Plasma",
                        "NK",
                        "ProB",
                        "B",
                        "T",
                        "CTL",
                        "pDC")
```

```{r}
galen_aml_donors@meta.data <- galen_aml_donors@meta.data %>%
  mutate(IsCancerCelltype = ifelse(grepl("-like", CellType), T, F)) %>%
  mutate(CellTypeGeneral = ifelse(
    CellType %in% lymphoid_celltypes,
    "Lymphoid",
    ifelse(
      IsCancerCelltype,
      "Myeloid/HSPC-like",
      ifelse(grepl("Ery", CellType), "Erythroid", "Myeloid/HSPC")
    )
  ))
```

Visualize cell type proportions across samples. 


```{r fig.height=4, fig.width=8}
galen_aml_donors@meta.data %>%
  group_by(orig.ident) %>%
  count(CellType) %>%
  ggplot(aes(fill = CellType, y = n, x = orig.ident)) +
  geom_bar(position = "fill", stat = "identity") +
  scale_fill_manual(values = cell_type_colors[unique(galen_aml_donors@meta.data$CellType)]) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_blank())

galen_aml_donors@meta.data %>%
  group_by(orig.ident) %>%
  count(IsCancerCelltype) %>%
  ggplot(aes(fill = IsCancerCelltype, y = n, x = orig.ident)) +
  geom_bar(position = "fill", stat = "identity") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_blank())

galen_aml_donors@meta.data %>%
  group_by(orig.ident) %>%
  count(PredictionRefined) %>%
  ggplot(aes(fill = PredictionRefined, y = n, x = orig.ident)) +
  geom_bar(position = "fill", stat = "identity") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_blank())

galen_aml_donors@meta.data %>%
  group_by(orig.ident) %>%
  count(CellTypeGeneral) %>%
  ggplot(aes(fill = CellTypeGeneral, y = n, x = orig.ident)) +
  geom_bar(position = "fill", stat = "identity") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_blank())
```

Myeloid and HSPCs are the cells that are difficult to distinguish between healthy and malignant in AML cells. 
This is the actual difficult prediction task of the classifier. 
Visualize the proportion of these cells in the samples. 

```{r fig.height=4, fig.width=8}
galen_aml_donors@meta.data %>%
  filter(CellTypeGeneral %in% c("Myeloid/HSPC", "Myeloid/HSPC-like")) %>%
  group_by(orig.ident, CellTypeGeneral) %>%
  summarize(n_cells = n()) %>%
  ggplot(aes(x = orig.ident, y = n_cells, fill = CellTypeGeneral)) +
  geom_bar(stat = "identity", position = "fill") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_blank())
```

Proportion of cycling cells is not higher in AML samples at diagnosis than in healthy samples.
AML is a disease of differentiation. 
The accumulation of blasts in the bone marrow is the problem, not hyper proliferation, as in other cancers. 

```{r fig.height=4, fig.width=8}
galen_aml_donors@meta.data %>%
  filter(`Days from diagnosis` == "D0" |
           SampleClass == "HealthyDonor") %>%
  group_by(orig.ident, SampleClass) %>%
  count(CyclingBinary) %>%
  ggplot(aes(fill = CyclingBinary, y = n, x = orig.ident)) +
  geom_bar(position = "fill", stat = "identity") +
  facet_grid( ~ SampleClass, scales = "free_x", space = "free") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), panel.grid = element_blank())
```


### Genotype information

Classify cells based on their genotype information into malignant and healthy cells. 

I classify all cells with a mutant transcript detected as malignant cells.  
All other cells from AML samples with a wt allele detected could be classified as healthy.  
However, this is imperfect, especially for heterozygous dominant negative mutations.  
Furthermore, due to clonality, there might be cells with wt allele detected and mut allele missed.  
Both can lead to false positive healthy cells. 

Therefore, only using cells from AML donors genotyped as mutant and healthy cells from healthy donors in the classifier (as done by van Galen).

"The second classifier is used for determining if a cell for which we did not detect a mutant transcript is malignant or normal, based on its similarity to normal and malignant cells (i.e., cells from healthy BM and HSC to myeloid-like cells from tumor samples for which we detected mutant transcripts)."

```{r}
galen_aml_donors@meta.data <- galen_aml_donors@meta.data %>%
  mutate(genotype = ifelse(
    !is.na(MutTranscripts) &
      MutTranscripts != "",
    "malignant",
    ifelse(
      !is.na(WtTranscripts) &
        WtTranscripts != "",
      "healthy",
      "not_detected"
    )
  ))
```

Plot 1: Cells with mut detected, only wt detected or no coverage, compared to expected blast count.

Plot 2: Cells usable to trian malignant vs. healthy classifier. 

There are multiple samples and patients with no or almost no malignant genotyped cells that can be used for training.  
Therefore, the classifier needs to be generalizable to other samples and mutations.  
Overfitting to specific patients would be bad.

```{r}
p1 <- galen_aml_donors@meta.data %>%
  group_by(orig.ident, genotype) %>%
  summarise(n_cells = n()) %>%
  ungroup() %>%
  ggplot(aes(fill = genotype, y = n_cells, x = orig.ident)) +
  geom_bar(stat = "identity") +
  theme_bw() +
  theme(axis.text.x = element_blank(),
        panel.grid = element_blank()) +
  scale_x_discrete(drop = FALSE) +
  xlab(element_blank())

p2 <- galen_aml_donors@meta.data %>%
  select(`Blast count`, orig.ident) %>%
  unique() %>%
  ggplot(aes(y = `Blast count`, x = orig.ident)) +
  geom_bar(stat = "identity") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_blank()) +
  scale_x_discrete(drop = FALSE) +
  xlab(element_blank())

ggpubr::ggarrange(
  p1,
  p2,
  align = "v",
  heights = c(2, 1),
  ncol = 1,
  nrow = 2
)

p1 <- galen_aml_donors@meta.data %>%
  mutate(Sample = as.factor(Sample)) %>%
  filter(genotype == "malignant" |
           SampleClass == "HealthyDonor") %>%
  group_by(Sample, genotype) %>%
  summarise(n_cells = n()) %>%
  ungroup() %>%
  ggplot(aes(fill = genotype, y = n_cells, x = Sample)) +
  geom_bar(stat = "identity") +
  theme_bw() +
  scale_x_discrete(drop = FALSE) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_blank()) + scale_x_discrete(drop = FALSE)

p2 <- galen_aml_donors@meta.data %>%
  filter(genotype == "malignant" |
           SampleClass == "HealthyDonor") %>%
  group_by(orig.ident, genotype) %>%
  summarise(n_cells = n()) %>%
  ungroup() %>%
  ggplot(aes(fill = genotype, y = n_cells, x = orig.ident)) +
  geom_bar(stat = "identity") +
  theme_bw() +
  scale_x_discrete(drop = FALSE) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_blank()) + scale_x_discrete(drop = FALSE)

ggpubr::ggarrange(
  p1,
  p2,
  align = "v",
  heights = c(1, 1),
  ncol = 1,
  nrow = 2
)
```

Concordance of genotype and van Galen celltype annotation in AML samples. 
NB: Cells with only wt transcript detected ("healthy") in AML samples might not be actually healthy (heterozygosity, clonality). 

```{r}
galen_aml_donors_metadata <- galen_aml_donors@meta.data
table(
    filter(galen_aml_donors_metadata, SampleClass != "HealthyDonor")$genotype,
    filter(galen_aml_donors_metadata, SampleClass != "HealthyDonor")$CellType
  )
```


<!-- Heatmap of number of cells from AML donors -->

<!-- ```{r} -->
<!-- galen_aml_donors_metadata <- galen_aml_donors@meta.data -->
<!-- heatmap( -->
<!--   table( -->
<!--     filter(galen_aml_donors_metadata, SampleClass != "HealthyDonor")$genotype, -->
<!--     filter(galen_aml_donors_metadata, SampleClass != "HealthyDonor")$CellType -->
<!--   ), -->
<!--   scale = NULL, -->
<!--   Colv = NA, -->
<!--   Rowv = NA -->
<!-- ) -->
<!-- ``` -->



Sex bias in cells that can be used to train classifier (mut transcript detected from AML donor, or from healthy donor). 

```{r fig.height=3, fig.width=6}
p1 <- galen_aml_donors@meta.data %>%
  filter(genotype == "malignant" |
           SampleClass == "HealthyDonor") %>%
  mutate(class = ifelse(genotype == "malignant", "malignant", "healthy")) %>%
  group_by(Gender, class) %>%
  summarise(n_cells = n()) %>%
  ungroup() %>%
  ggplot(aes(x = class, y = n_cells, fill = Gender)) +
  geom_bar(position = "fill", stat = "identity") +
  # geom_bar(stat = "identity") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_blank())

p2 <- galen_aml_donors@meta.data %>%
  filter(genotype == "malignant" |
           SampleClass == "HealthyDonor") %>%
  mutate(class = ifelse(genotype == "malignant", "malignant", "healthy")) %>%
  group_by(Gender, class) %>%
  summarise(n_cells = n()) %>%
  ungroup() %>%
  ggplot(aes(x = Gender, y = n_cells, fill = class)) +
  geom_bar(position = "fill", stat = "identity") +
  # geom_bar(stat = "identity") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_blank())

p1 | p2
```


### Accuracy of the classifier

Check the accuracy of the van Galen classifier based on concordance with clinical blast count. 
This has already been checked in the paper and the answer is that the correlation is high. 

```{r}
# samples with unclear prediction
galen_aml_donors@meta.data %>%
  filter(PredictionRefined == "unclear") %>%
  pull(orig.ident) %>%
  unique() %>% as.character()

pred_malignant_vs_blast_count <- galen_aml_donors@meta.data %>%
  group_by(
    orig.ident,
    PredictionRefined,
    `Blast count`,
    `Cell number`,
    Sample,
    Gender,
    # DriverMutations,
    `Days from diagnosis`
  ) %>%
  summarize(PredictionRefinedProportion = n() / `Cell number`) %>%
  filter(PredictionRefined == "malignant") %>%
  select(
    `Blast count`,
    PredictionRefinedProportion,
    orig.ident,
    Sample,
    Gender,
    # DriverMutations,
    `Days from diagnosis`
  ) %>%
  unique() %>%
  mutate(`Blast count` = as.numeric(`Blast count`)) %>%
  ungroup()

nrow(pred_malignant_vs_blast_count)

pred_malignant_vs_blast_count %>%
  ggplot(aes(
    x = PredictionRefinedProportion,
    y = `Blast count`,
    group = Sample,
    color = Sample
  )) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0) +
  ggrepel::geom_text_repel(aes(label = orig.ident),
                           size = 3,
                           max.overlaps = Inf) +
  theme_bw() +
  coord_equal() +
  theme(panel.grid = element_blank())
```



```{r}
cor.test(
  pred_malignant_vs_blast_count$PredictionRefinedProportion,
  pred_malignant_vs_blast_count$`Blast count`
)
```

If the sex bias in healthy and AML samples matters, then the difference between predicted and observed blast count (error) should be larger in females than males. 

Predicted proportion of malignant cells agrees less with clinical blast count for female samples.
Markedly, female samples have an overestimated proportion of malignant cell. 

However, this could also be due to other factors covarying with Sex, like mutated gene or blast count.

Median absolute error 10.8%(F), 3.1%(M).

```{r fig.height=3, fig.width=4}
pred_malignant_vs_blast_count_error <- pred_malignant_vs_blast_count %>%
  mutate(error = PredictionRefinedProportion - `Blast count`,
         absolute_error = abs(error),
         relative_error = PredictionRefinedProportion / `Blast count`) %>%
  pivot_longer(
    cols = c(error, absolute_error, relative_error),
    names_to = "error_type",
    values_to = "error"
  ) %>%
  group_by(error_type, Gender) 

pred_malignant_vs_blast_count_error %>%
  summarize(median_error = round(median(error), 3))
```


```{r fig.height=3, fig.width=4}
symmetric_limits <- function (x)
{
  max <- max(abs(x))
  c(-max, max)
}

p1 <- pred_malignant_vs_blast_count_error %>%
  filter(error_type %in% c("error", "absolute_error")) %>%
  ggplot(aes(x = error_type, y = error * 100, fill = Gender)) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_boxplot(outlier.size = 0) +
  geom_point(pch = 21, position = position_jitterdodge(jitter.width = 0.1)) +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +
  theme_bw() +
  theme(panel.grid = element_blank(), legend.position = "None") +
  ylab("Prediction error % blasts") +
  scale_y_continuous(limits = symmetric_limits) +
  xlab(element_blank()) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

p2 <- pred_malignant_vs_blast_count_error %>%
  filter(error_type %in% c("relative_error")) %>%
  ggplot(aes(x = error_type, y = log2(error), fill = Gender)) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_boxplot(outlier.size = 0) +
  geom_point(pch = 21, position = position_jitterdodge(jitter.width = 0.1)) +
  theme_bw() +
  theme(panel.grid = element_blank()) +
  ylab("log2 relative error % blasts") +
  scale_y_continuous(limits = symmetric_limits) +
  xlab(element_blank()) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggpubr::ggarrange(p1, p2, align = "h", widths = c(1, 1.2))
```


<!-- Residuals of linear model.  -->

<!-- ```{r fig.height=3, fig.width=3.5} -->
<!-- model <- lm(PredictionRefinedProportion ~ `Blast count`, data=pred_malignant_vs_blast_count)  -->

<!-- tmp <- pred_malignant_vs_blast_count %>% -->
<!--   dplyr::mutate( -->
<!--     fits = fitted(model), -->
<!--     resids = resid(model), -->
<!--     sresids = rstudent(model) -->
<!--   ) -->

<!-- tmp %>% -->
<!--   group_by(Gender) %>% -->
<!--   summarize(sum(abs(resids)), mean(abs(resids)), median(abs(resids))) -->

<!-- #create residual plot -->
<!-- ggplot(tmp, aes(x = fits, y = resids, color = Gender)) + -->
<!--   ggrepel::geom_text_repel(aes(label = paste(Sample, , sep = "|")), max.overlaps = Inf) + -->
<!--   geom_hline(yintercept = 0, linetype = 2) + -->
<!--   geom_point() + -->
<!--   theme_bw() -->
<!-- ``` -->



### Data quality

According to manuscript, cells were filtered for >1000 UMI, >500 genes and <20% mitochondrial+ribosomal transcripts.

The Rdata object contains filtered cells. 

```{r fig.width=20, fig.height=10}
VlnPlot(galen_aml_donors, features = c("nCount_RNA", "nFeature_RNA"), ncol = 1, log = T, pt.size = 0)
```

<!-- ```{r fig.height=4, fig.width=5} -->
<!-- galen_aml_donors@meta.data %>% -->
<!--   group_by(orig.ident, `Cell number`, SampleClass) %>% -->
<!--   dplyr::summarise(median_nFeature_RNA = median(nFeature_RNA)) %>% -->
<!--   ggplot(aes(x = `Cell number`, y = median_nFeature_RNA)) + -->
<!--   geom_point(aes(color = SampleClass)) + -->
<!--   geom_smooth(method = "lm") + -->
<!--   theme_bw() + -->
<!--   theme(panel.grid = element_blank()) -->
<!-- ``` -->


### Low dimensional embedding

#### AML and healthy donors

Normalize data by library size and log transform.  
Identify 2000 highly variable genes.  
Scale/z-score highly variable genes.  

```{r results=F}
galen_aml_donors <- NormalizeData(galen_aml_donors, scale.factor = 10000)
galen_aml_donors <- FindVariableFeatures(galen_aml_donors)
galen_aml_donors <- ScaleData(galen_aml_donors)
```

Calculate PCA and check elbow plot for appropriate number of PCs for UMAP embedding. 

```{r results=F}
galen_aml_donors <- RunPCA(galen_aml_donors)
```


```{r}
ElbowPlot(galen_aml_donors, ndims = 50) +
  coord_cartesian(ylim = c(0, NA))
```

Calculate UMAP embedding. 

```{r results=F}
galen_aml_donors <- RunUMAP(galen_aml_donors, dims = 1:15)
```

Check UMAP for technical and biological variation.

Higher read counts in erythrocyte progenitors.


```{r fig.height=6, fig.width=7}
# technical
FeaturePlot(galen_aml_donors, features = "nCount_RNA", pt.size = 0.005, max.cutoff = "q99") + coord_equal()
FeaturePlot(galen_aml_donors, features = "nFeature_RNA", pt.size = 0.005) + coord_equal()
```


```{r fig.height=6, fig.width=7}
# annotation biological
DimPlot(galen_aml_donors, group.by = "SampleClass", shuffle = T, pt.size = 0.005) + coord_equal()
DimPlot(galen_aml_donors, group.by = "IsCancerCelltype", shuffle = T, pt.size = 0.005) + coord_equal()
DimPlot(galen_aml_donors, group.by = "PredictionRefined", shuffle = T, pt.size = 0.005) + coord_equal()
DimPlot(galen_aml_donors, group.by = "CellType", shuffle = T, pt.size = 0.005, label = T, label.size = 3) + coord_equal()
DimPlot(galen_aml_donors, group.by = "orig.ident", shuffle = T, pt.size = 0.005, label = T, label.size = 2) + coord_equal() + NoLegend()
FeaturePlot(galen_aml_donors, features = "Age", pt.size = 0.005) + coord_equal()
DimPlot(subset(galen_aml_donors, SampleClass == "HealthyDonor"), group.by = "orig.ident", shuffle = T, pt.size = 0.005) + coord_equal()
```


```{r fig.height=6, fig.width=7}
# sex differences
DimPlot(galen_aml_donors, group.by = "Gender", shuffle = T, pt.size = 0.005) + coord_equal()
DimPlot(subset(galen_aml_donors, SampleClass == "AmlDonor"), group.by = "Gender", shuffle = T, pt.size = 0.005) + coord_equal()
DimPlot(subset(galen_aml_donors, SampleClass == "AmlDonor"), group.by = "Sample", shuffle = T, pt.size = 0.005, split.by = "Gender") + coord_equal()
DimPlot(galen_aml_donors, group.by = "Sample", shuffle = T, pt.size = 0.005, split.by = "Gender") + coord_equal()
```

#### Embedding healthy donors

Create an embedding of just the healthy samples to visualize the normal hematopoietic trajectory.  
Same procedure as above. 

```{r}
galen_aml_donors_healthy <- subset(galen_aml_donors, SampleClass == "HealthyDonor")
```

```{r results=F}
galen_aml_donors_healthy <- FindVariableFeatures(galen_aml_donors_healthy)
galen_aml_donors_healthy <- ScaleData(galen_aml_donors_healthy)
```

```{r results=F}
galen_aml_donors_healthy <- RunPCA(galen_aml_donors_healthy)
```

```{r}
ElbowPlot(galen_aml_donors_healthy, ndims = 50) +
  coord_cartesian(ylim = c(0, NA))
```

```{r results=F}
galen_aml_donors_healthy <- RunUMAP(galen_aml_donors_healthy, dims = 1:15)
```


```{r fig.height=5, fig.width=7}
DimPlot(galen_aml_donors_healthy, group.by = "CellType", shuffle = T, pt.size = 0.01, label = T, label.size = 3) +
  coord_equal()
DimPlot(galen_aml_donors_healthy, group.by = "orig.ident", shuffle = T, pt.size = 0.01, label = T, label.size = 3) +
  coord_equal() + NoLegend()
```


<!-- Check how reliable UMAP is, i.e. how well it captures the PC space, using the default UMAP parameters. -->

<!-- ```{r} -->
<!-- library(scDEED) -->
<!-- K = 8 -->
<!-- result = scDEED(galen_aml_donors_healthy, K = K, reduction.method = 'umap', n_neighbors = 30, min.dist = 0.3, rerun = F) -->
<!-- ``` -->

<!-- ```{r} -->
<!-- dubious_cells = result$full_results$dubious_cells[result$full_results$n_neighbors == '30'] -->
<!-- dubious_cells = as.numeric(strsplit(dubious_cells, ',')[[1]]) -->
<!-- trustworthy_cells =  result$full_results$trustworthy_cells[result$full_results$n_neighbors == '30'] -->
<!-- trustworthy_cells = as.numeric(strsplit(trustworthy_cells, ',')[[1]]) -->

<!-- DimPlot( -->
<!--   galen_aml_donors_healthy, -->
<!--   reduction = 'umap', -->
<!--   cells.highlight = list('dubious' = dubious_cells, 'trustworthy' = trustworthy_cells) -->
<!-- ) + scale_color_manual(values = c('gray', 'blue', 'red')) + coord_equal() -->
<!-- ``` -->

#### AML and healthy donors excluding sex-specific genes

SDE scores for all protein-coding genes in 45 tissues common to men and women. 
Genes were analyzed by NOISeqBIO with scores of zero given for genes with insignificant differential expression. 
Other genes have SDE scores below zero for men-biased expression and above zero for women-biased expression. (CSV 2205 kb)

Bone marrow is not available. Instead use whole blood and spleen.

```{r}
sex_genes <- read_csv("gershoni_et_al/12915_2017_352_MOESM3_ESM.csv")
sex_genes <- sex_genes %>% 
  select(Gene, Spleen, Whole_Blood) %>%
  filter(Spleen != 0 | Whole_Blood != 0) %>%
  mutate(gene = gsub("_.*", "", Gene)) %>%
  pull(gene)

sex_genes <- c(sex_genes, "XIST")

# make sure IDs match
sex_genes[!sex_genes %in% rownames(galen_aml_donors)]

length(sex_genes)
```

Also exclude genes expressed on Y chromosome.
149 genes present in scRNA-seq data.

```{r}
y_chrom_genes <- read_lines("reference_sources/gencode.v44_gene_names_chrY.txt")
y_chrom_genes <- y_chrom_genes[y_chrom_genes %in% rownames(galen_aml_donors)]
length(y_chrom_genes)

sex_genes <- unique(c(sex_genes, y_chrom_genes))
```

6 out of 2000 highly variable genes (AML+healthy) have sex-specific expression.

I don't think removal of all genes that are differentially expressed betwen M and F should be remove, because there variables that co-vary with sex. 

```{r}
# sex-specific genes used for embedding
sex_genes[sex_genes %in% VariableFeatures(galen_aml_donors)]
```

Re-calculate PCA, excluding sex-specific genes from HVG list.

```{r results=F}
galen_aml_donors <-
  RunPCA(
    galen_aml_donors,
    features = VariableFeatures(galen_aml_donors)[!VariableFeatures(galen_aml_donors) %in% sex_genes],
    reduction.name = "pca_no_sex"
  )
```


```{r}
ElbowPlot(galen_aml_donors, ndims = 50, reduction = "pca_no_sex") +
  coord_cartesian(ylim = c(0, NA))
```

```{r results=F}
galen_aml_donors <-
  RunUMAP(
    galen_aml_donors,
    dims = 1:15,
    reduction = "pca_no_sex",
    reduction.name = "umap_no_sex"
  )
```

I would check if the integration of sexes has improved if CalcAlignmentMetric was still available in Seurat v5...

MixingMetric has replaced CalcAlignmentMetric (Seurat v2).


Explanation of [MixingMetric](https://github.com/satijalab/Integration2019/issues/1)

> 1. Look at the k.max nearest neighbors for a cell across all datasets.
  2. Compute the k closest neighbors for each dataset individually for the same cell.
  3. Find the rank in the overall neighborhood list (from 1.) that corresponds to the kth neighbor in each dataset (max of k.max) and take the median across all datasets.
  4. Compute 1-3 for every cell and average.
  This average is the value that is returned by the MixingMetric function. However, it's usually more intuitive to think of high values of "mixing" to be better so for visualization, we often plot max.k - MixingMetric.

Deafult `max.k = 300`.

According to this metric, mixing of cells from male and female AML donors has *slightly* improved. 

```{r results = F}
mixing_aml_sexes <- MixingMetric(
  subset(galen_aml_donors, SampleClass == "AmlDonor"),
  reduction = "pca",
  dims = 1:15,
  grouping.var = "Gender"
)

mixing_aml_sexes_corrected <- MixingMetric(
  subset(galen_aml_donors, SampleClass == "AmlDonor"),
  reduction = "pca_no_sex",
  dims = 1:15,
  grouping.var = "Gender"
)
```


```{r}
300 - mean(mixing_aml_sexes)
300 - mean(mixing_aml_sexes_corrected)
```



```{r fig.height=6, fig.width=7}
# sex differences
DimPlot(
  subset(galen_aml_donors, SampleClass == "AmlDonor"),
  reduction = "umap_no_sex",
  group.by = "Gender",
  shuffle = T,
  pt.size = 0.005
) + coord_equal()
```

## Task 2: ML classifier malignant cells

**Task Description**

An interesting feature of the dataset is the availability of labels distinguishing malignant and non-malignant single cells. 
The second task is to exploit these labels to build a suitable ML classifier that predicts the normal or malignant identity of single cells. 
Please detail on how your classifier works, whether it is interpretable, its performance, and the train-test split you use for your machine learning experiment.  

### Considerations for the task

The tricky part is not training a classifier, but to consider biology and technology to train a *good* classifier. 

I assume that "labels distinguishing malignant and non-malignant single cells" in the task description refers to the genotyped cells, and not the labels predicted by van Galen et al. based on their random forest classifiers.  

Van Galen et al. train two random forest classifiers (see below), in order to classify cells from AML donors into malignant and normal cells.  
The first classifier is for HSPC cell types and the second for HSPC cell types + malignant vs. normal. 
Each classifier consists of an "outer" classifier for feature selection and an "inner" classifier that is used for prediction. 
Due to very limited number of genotyped cells that can be used for training, they do not hold out a test set. 
Instead, to evaluate their model performance, they perform 5-fold cross validation of the "inner" classifier.  
Finally, they use the model trained on all available data for classification of AML cells that were not genotyped.  

I assume that at the time there was no independent dataset available with genotyped AML cells to test for generalizability of the model to other scRNA-seq technologies (droplet-based, 5' sequencing), human populations, or driver mutations. 

**Points of critique**

1. Including all training data in the feature selection makes the CV results of the "inner" classifier less informative.
2. The CV is supposed to evaluate over- and underfitting.
    To estimate overfitting (complexity of the model), the model should be evaluated on how well it extrapolates to another donor and to another driver mutation, not a random sample of all training cells.
3. All healthy cells in the training data are male. Van Galen do not address this by removing sex-specific genes from the features. 
    (I'm honestly suprised that the classifier doesn't perform even worse on female samples based on blast count than it does. 
    A male sample with loss of y might play a role, and expression level/sparsity of sex-specific genes.)
4. Cells were genotyped based on transcripts of following genes:  
    TET2, SETD2, PTPN11, NRAS, SF3A1, BCOR, KIT, RAD21, TP53, BRCC3, RUNX1, FLT3, IDH2, DNMT3A, SMC3, NPM1, KRAS.  
    Cells with higher expression of these genes are more likely to be genotyped and make it into the training data.
    Their expression should be excluded from the feature list, also to make it more generalisable to other driver mutations. 
5. The classifier does not consider possible differences in transcription between non-malignant HSPCs from AML donors and healthy donors. 
    Non-malignant cells from AML donors have possibly altered transcription due to changes in the environment in the bone marrow.  
    The classifier only considers healthy cells from healthy donors, and not non-malignant cells from AML donors. 
    This is a limitation of the technology, because cells with wt transcript detected but no mutated transcript detected cannot confidently be classified as normal, due to clonality and heterozygosity.  
    However, there might be a subset of cells from donors with known zygosity and clonal structure where this is possible. 

What I agree with, is the observation that, generally, malignant HSCs are transcriptionally more similar to healthy HSCs than to malignant monocyte progenitors.
Therefore, it makes sense to guide the model with information on similarity towards healthy HSPC cell types.  

Random forest classifiers allow the extraction of feature importance metrics (e.g. number of trees in which a feature is used).
However, one has to differ between a *minimal optimal subset* and an *all relevant subset* of genes. 
This is especially relevant for gene-expression based classifiers due to redundance (think gene modules/signatures).
See [Kursa et al. 2014](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-8) for more details. 

Van Galen et al. select as features the genes chosen most often in the outer classifier, across 1000 trees. This does not seem to consider classes. 

**Ideas**

- Instead of predicting HSC and HSC-like cells, why not give a similarity score towards each healthy cell type as input feature. 
    This would allow to inform the classifier for relevant features like combinations of signatures (e.g. LSCs arising from GMPs with mixed HSC/GMP signature).
- Why include lymphoid and erythroid cell types in the malignant vs. non-malignant classifier? 
    They are transcriptionally very distinct and in AML they are not malignant. 
    The classifier might have an easier time if they are excluded. 
- Consider other methods to assign the most similar healthy cell type, such as label transfer via projection onto the healthy reference (Azimuth). 
- Use an independent test set of genotypes AML cells. 

**Van Galen classifier malignant vs. normal AML cells**

```{r echo = F, out.width="100%", fig.width=15}
nomnoml::nomnoml(code = "#direction: down
[<frame>HSPC celltype annotation |
[Healthy cells] - 
[<state> Backspin clustering | Manual HSPC cell type annotation] - 
[Healthy cells, annotated by HSPC cell types] - [<state> Train RF] - 
[Classifier for HSPC cell types] - [<state> Predict]
[AML cells] - [<state> Predict] -
[AML cells, annotated by HSPC cell types]
]

[<frame>Malignant celltype annotation |
[AML cells, genotyped malignant | HSPC-like cell type labels] - [<state> Train RF]
[Healthy cells | HSPC cell type labels] - [<state> Train RF] -
[Classifier for HSPC & HSPC-like cell types] - [<state> Predict]
[AML cells, not genotyped malignant | HSPC-like cell type labels] - [<state> Predict] -
[AML cells, annotated by HSPC & HSPC-like cell types]
]")
```


### Transcriptional heterogeneity of malignant and healthy cells

The goal is to develop a classifier of malignant vs. healthy HSPCs and myeloid cells from AML patients, that is generalizable to other AML types with other driver mutations and other scRNA-seq technologies/datasets. 

Evaluate the transcriptional heterogeneity of AML subtypes and malignant vs. healthy cells.

Q1: Do healthy cells differ from malignant cells based on their transcriptome?

Q2: Is the malignant transcriptional profile specific to driver mutations?

We are only interested in myeloid and HSPC cells here, because distinction of erythroid and lymphoid cells is relatively easy. 
Those cells are not malignant in acute *myeloid* leukemia. 

For a start, use HSPCs + myeloid cells, according to van Galen cell type annotation.  

<!-- Only consider samples with blast count > 0.1 and at time of diagnosis.  -->
<!-- These were 2 variables that accounted for most of the variance in the data.  -->

Include healthy donors, to see if the cells genotyped as wt group with definitive healthy cells. 

```{r}
selected_samples <- donor_metadata %>%
  # filter(
  #   !grepl("BM", Sample),
  #   # `Blast count` > 0.1,
  #   # `Days from diagnosis` == "D0"
  #   ) %>%
  pull(SampleId)

galen_aml_donors_myeloid_hspc <-
  subset(
    galen_aml_donors,
    subset = (orig.ident %in% selected_samples & SampleClass == "HealthyDonor" | 
      ((CellTypeGeneral == "Myeloid/HSPC-like" |
      CellTypeGeneral == "Myeloid/HSPC") & genotype != "not_detected"))
  )
table(galen_aml_donors_myeloid_hspc$genotype)

# only keep SampleId-genotype combinations with at least 10 cells
samples_enough_cells <- names(which(table(galen_aml_donors_myeloid_hspc$orig.ident) >= 10))
galen_aml_donors_myeloid_hspc <-
  subset(
    galen_aml_donors_myeloid_hspc,
    subset = orig.ident %in% samples_enough_cells
  )
```

```{r}
galen_aml_donors_myeloid_hspc@meta.data <- galen_aml_donors_myeloid_hspc@meta.data %>%
  mutate(
    SampleId_genotype =
      paste(
        galen_aml_donors_myeloid_hspc$orig.ident,
        galen_aml_donors_myeloid_hspc$genotype,
        sep = "_"
      )
  ) %>%
  mutate(SampleId_genotype = ifelse(SampleClass == "HealthyDonor", gsub("not_detected", "healthy", SampleId_genotype), SampleId_genotype))
```


```{r}
pb_counts <-
  Seurat::AggregateExpression(
    galen_aml_donors_myeloid_hspc,
    assays = "RNA",
    slot = "counts",
    # group.by = "SampleId_CellTypeGeneral"
    group.by = "SampleId_genotype"
    # group.by = "orig.ident"
  )$RNA

group_anno <- colnames(pb_counts)
group_anno <- gsub(".*-", "", group_anno)
 
pb_dge <- edgeR::DGEList(
    counts = pb_counts,
    samples = colnames(pb_counts),
    group = group_anno
)
```

Filter out samples with low read count (=few filtered cells).  
Keep samples with read counts > 50,000, which is all samples.

```{r}
summary(pb_dge$samples$lib.size)

keep.samples <- pb_dge$samples$lib.size > 5e4
table(keep.samples)

# pb_dge <- pb_dge[, keep.samples]
```

Filter out lowly expressed genes.  
Keep genes with at least 10 CPM in at least 2 samples.
 
```{r}
# genes with at least 1 CPM in at least 2 samples
keep.genes <- rowSums(edgeR::cpm(pb_dge) >=10) >= 2
table(keep.genes)
 
pb_dge <- pb_dge[keep.genes, , keep=FALSE]
```

TMM normalization is performed to estimate effective library size.
 
```{r}
pb_dge <- edgeR::calcNormFactors(pb_dge, method = "TMM")
```
 
Calculate Multidimensional scaling (MDS) plot of distances between gene expression profiles.
 
```{r}
pb_mds <- limma::plotMDS(pb_dge, plot = F)

pb_mds_df <- data.frame(
  list(
    x = pb_mds$x,
    y = pb_mds$y,
    lib.size = pb_dge$samples$lib.size,
    sample = gsub("-.*", "", colnames(pb_dge)),
    group = pb_dge$samples$group
  )
) %>%
  merge(donor_metadata, by.x = "sample", by.y = "SampleId", all.x = T)
```

Check that major source of variation is not technical (sample size) or irrelevant biological information in this case (blast count).

```{r fig.height=3.5, fig.width=5}
xlab_text <- paste0(
    "Leading logFC dim 1 (",
    round(pb_mds$var.explained[1] * 100, 1),
    "%)"
  )
ylab_text <- paste0(
    "Leading logFC dim 2 (",
    round(pb_mds$var.explained[2] * 100, 1),
    "%)"
  )

mds_theme <- theme_bw() +
  theme(panel.grid = element_blank())

pb_mds_df %>%
  unique() %>%
  ggplot(aes(x, y, label = sample)) +
  geom_point(aes(color = log(lib.size))) +
  # ggrepel::geom_text_repel() +
  xlab(xlab_text) + ylab(ylab_text) +
  mds_theme

pb_mds_df %>%
  ggplot(aes(x, y, label = sample)) +
  geom_point(aes(color = `Blast count`)) +
  # ggrepel::geom_text_repel() +
  xlab(xlab_text) + ylab(ylab_text) +
  mds_theme

pb_mds_df %>%
  ggplot(aes(x, y, label = sample)) +
  geom_point(aes(color = `Days from diagnosis` == "D0")) +
  # ggrepel::geom_text_repel() +
  xlab(xlab_text) + ylab(ylab_text) +
  mds_theme

pb_mds_df %>%
  ggplot(aes(x, y, label = sample)) +
  geom_point(aes(color = group)) +
  # ggrepel::geom_text_repel() +
  xlab(xlab_text) + ylab(ylab_text) +
  mds_theme

pb_mds_df %>%
  ggplot(aes(x, y, label = sample)) +
  geom_point(aes(color = Gender)) +
  # ggrepel::geom_text_repel() +
  xlab(xlab_text) + ylab(ylab_text) +
  mds_theme

pb_mds_df %>%
  ggplot(aes(x, y, label = sample)) +
  geom_point(aes(color = Sample)) +
  ggrepel::geom_text_repel(
    size = 2,
    min.segment.length = 0,
    max.overlaps = Inf
  ) +
  xlab(xlab_text) + ylab(ylab_text) +
  mds_theme +
  theme(legend.position = "None")
```


Check if the samples split by mutation.

```{r}
sampleId_mutation_list <- donor_metadata %>%
  select(SampleId, `RHP Mutations`, `Common translocation`) %>%
  filter(SampleId %in% pb_mds_df$sample) %>%
  pivot_longer(cols = c(`RHP Mutations`, `Common translocation`), values_to = "driver_mutation") %>%
  select(-name) %>%
  filter(!driver_mutation %in% c("Not performed", "None Detected", "Unknown", "NA") & !is.na(driver_mutation)) %>%
  separate_longer_delim(cols = driver_mutation, "/// ") %>%
  mutate(driver_mutation = gsub(" .*", "", driver_mutation)) %>%
  # unique for samples with multiple mutations in same gene
  unique() %>%
  mutate(value = T) %>%
  pivot_wider(id_cols = SampleId, names_from = driver_mutation, values_fill = F) %>%
  pivot_longer(cols = -SampleId, names_to = "driver_mutation")
```


```{r}
pb_mds_df_driver_mutation <- merge(pb_mds_df, sampleId_mutation_list, by.x = "sample", by.y = "SampleId", all.x = T)
```

PC1 aligns with DNMTA3. Indication that DNMT3 has a distinct transcriptional profile. 

"DNMT3A mutations occur in approximately 20% of AML cases and are associated with changes in DNA methylation. CDKN2B plays an important role in the regulation of hematopoietic progenitor cells and DNMT3A mutation is associated with CDKN2B promoter methylation."
https://pmc.ncbi.nlm.nih.gov/articles/PMC7106122/

KRAS, FLT3-ITD, NPM1, NRAS

```{r, fig.height=8}
pb_mds_df_driver_mutation %>%
  filter(group == "malignant") %>%
  ggplot(aes(x, y, label = sample)) +
  geom_point(aes(color = value)) +
  facet_wrap(~driver_mutation) +
  xlab(xlab_text) + ylab(ylab_text) +
  theme_bw() +
  theme(panel.grid = element_blank()) +
  coord_equal()
```

```{r}
unique(sampleId_mutation_list$driver_mutation)
genes_genotypes <-
  c(
    "NRAS",
    "KRAS",
    "SF3A1",
    "NOTCH1",
    "NOTCH2",
    "SF3A1",
    "CBFB",
    "MYH11",
    "NPM1",
    "JAK3",
    "DNMT3A",
    "FLT3",
    "TP53",
    "SETD2",
    "RUNX1",
    "BCOR",
    "BCORL1",
    "PTPN11",
    "SMC3",
    "IDH2",
    "TET2",
    "BRCC3",
    "KIT",
    "RAD21",
    "MLL",
    "CEBPA"
  )
```

### Azimuth projection

Alternative to cell type identification with a random forest model is label transfer from a reference atlas. 

Perform label transfer from the Azimuth bone marrow reference atlas. 
Instead of training a random forest classifier to determine the most similar healthy equivalent of a cell.  

Azimuth provides cell type labels at 2 levels, see [interactive reference](https://azimuth.hubmapconsortium.org/references/human_bonemarrow/).

For this step, we are not interested in the genes distinguishing the different cell types because they are already well defined. 
So we don't need a RF classifier where we can extract feature importance metrics. 

The label transfer is very fast and easy and provides a well established reference framework of cell type classifications and hierarchies. 

Reference downloaded from [here](http://seurat.nygenome.org/src/contrib/bonemarrowref.SeuratData_1.0.0.tar.gz).

```{r}
library(Azimuth)
library(SeuratData)
```

```{r results=F, eval=F}
# The RunAzimuth function can take a Seurat object as input
# overwrites miscellanous data, so saving as new object
galen_aml_donors_azimuth <- RunAzimuth(galen_aml_donors,
                                       reference = "azimuth_reference/bonemarrowref.SeuratData/inst/azimuth/")
```

Add Azimuth labels to original seurat object.

```{r, eval=F}
galen_aml_donors <-
  AddMetaData(
    galen_aml_donors,
    galen_aml_donors_azimuth@meta.data,
    col.name = c("predicted.celltype.l1", "predicted.celltype.l2")
  )
```


The function returns cell type labels on 2 levels, and 2 QC metrics, a mapping score and a prediction score for each level.  
It also returns the projection of the cells onto the reference UMAP space.

```{r fig.height=7, eval=F}
DimPlot(galen_aml_donors_azimuth, group.by = "predicted.celltype.l1", shuffle = T, label = T) + coord_equal()
FeaturePlot(galen_aml_donors_azimuth, features = "predicted.celltype.l1.score") + coord_equal()
DimPlot(galen_aml_donors_azimuth, group.by = "predicted.celltype.l2", shuffle = T, label = T) + coord_equal()
FeaturePlot(galen_aml_donors_azimuth, features = "predicted.celltype.l2.score") + coord_equal()
FeaturePlot(galen_aml_donors_azimuth, features = "mapping.score") + coord_equal()

DimPlot(galen_aml_donors_azimuth, group.by = "SampleClass", shuffle = T) + coord_equal()
DimPlot(galen_aml_donors_azimuth, group.by = "CellType", shuffle = T, label = T) + coord_equal()
```


Compare Azimuth annotations with backspin cluster annotation, followed by RF.
Also compare with malignant/non malignant RF annotation. 

```{r fig.height=5, eval=F}
heatmap(table(galen_aml_donors$backspin_celltype, galen_aml_donors$predicted.celltype.l1), scale = NULL, Rowv = NA, Colv = NA)
heatmap(table(galen_aml_donors$backspin_celltype, galen_aml_donors$predicted.celltype.l2), scale = NULL, Rowv = NA, Colv = NA)

heatmap(table(galen_aml_donors$CellType, galen_aml_donors$predicted.celltype.l1), scale = NULL, Rowv = NA, Colv = NA)
heatmap(table(galen_aml_donors$CellType, galen_aml_donors$predicted.celltype.l2), scale = NULL, Rowv = NA, Colv = NA)
```

In the end, I realized that the Azimuth annotations are problematic, becaue either very granular or useless because lumping everything into HSPC. 


### Train classifier

backspin celltype annotation is available for 783 cells of 1,590 BM5 CD34+CD38–.

```{r}
y <- factor(galen_aml_donors_healthy$backspin_celltype)
X <- galen_aml_donors_healthy@assays$RNA@layers$data
X <- X[,!is.na(y)]
rownames(X) <- rownames(galen_aml_donors_healthy)
y <- y[!is.na(y)]

# # combine HSC and Prog
# y[y %in% c("HSC", "Prog")] <- "HSC_Prog_"
```

Van Galen et al. select genes with average normalized expression >= 0.01 (normalized to 10k reads).
This is biased by class. Instead, select genes expressed in 5% cells in at least 1 class (10141 genes).

```{r}
# mean normalized expression >= 0.01
keep.genes <- rowMeans(exp(X) - 1) >= 0.01
table(keep.genes)

# expressed in 1% cells
keep.genes.expressed <- rowSums((X > 0) / ncol(galen_aml_donors_healthy)) >= 0.01
table(keep.genes.expressed)

# expressed in 1% cells in at least 1 class
classes <- unique(y)
keep.genes.class <- lapply(classes, FUN = function(class) {
  genes <- names(which(rowSums((X[,y == class] > 0) / sum(y == class)) >= 0.05))
  return(genes)
})
keep.genes.class <- rownames(X) %in% unlist(keep.genes.class)
table(keep.genes.class)

X <- X[which(keep.genes.class),]
```


All cells are male. Remove sex-specific genes anyways. 

```{r}
keep.genes <- !rownames(X) %in% sex_genes
table(keep.genes)
X <- X[which(keep.genes),]
```


sampsize is the number of cells sampled from each class. By default `ceiling(.632*nrow(x))`.
This is one way to deal with imbalanced data in RF by using balanced data sets for each tree, even if the data is not balanced, which is made possible by bootstrap.

Train 100 trees because I'm impatient. 

```{r}
set.seed(123)
rf.celltype.outer <- randomForest(
  x = t(X),
  y = y,
  sampsize = rep(50, length(unique(y))),
  ntree = 200,
  do.trace = F
)
```

```{r}
rf.celltype.outer
```

Select 1k features from 14074 genes. 

bestvar is the variable used to split the node (0 if node is terminal).

```{r}
rf.celltype.outer.used1k <-
  names(sort(table(
    rownames(rf.celltype.outer$importance)[rf.celltype.outer$forest$bestvar[rf.celltype.outer$forest$bestvar !=
                                                                              0]]
  ), decreasing = TRUE)[1:1000])

plot(rf.celltype.outer$importance[rf.celltype.outer.used1k, 1])  # correlates to importance measure
```


```{r}
set.seed(123)
rf.celltype.inner <- randomForest(
  x = t(X[rf.celltype.outer.used1k,]),
  y = y,
  sampsize = rep(50, length(unique(y))),
  ntree = 200,
  do.trace = F
)
```

```{r}
rf.celltype.inner
```

```{r}
y_pred_healthy <- predict(rf.celltype.inner, newdata = t(X), type = "prob")
```

```{r}
rf.celltype.inner$confusion
```

earlyEry misclassified as lateEry and Prog.
Prog misclassified as HSC.

```{r}
heatmap(rf.celltype.inner$confusion, Rowv = NA, Colv = NA, scale = NULL)

rf.celltype.inner$confusion %>%
  as.data.frame() %>%
  rownames_to_column("class") %>%
  ggplot(aes(x = class, y = class.error)) +
  geom_bar(stat = "identity")
```

Predict celltype of malignant AML cells. 

"In four patients (AML314, AML371, AML722B and AML997), for which we detected few mutant transcripts and few high quality cells, we could not confidently assign malignant cells."

I also exclude AML475 and AML420B because they also only have 7 and 6 cells genotyped as malignant (and upon first classification, those cells were lymphoid cells).

```{r}
galen_aml_donors_malignant <- subset(galen_aml_donors, subset = (genotype == "malignant" & SampleClass != "HealthyDonor"))
table(galen_aml_donors_malignant$Sample)

selected_samples <-
  donor_metadata$Sample[!donor_metadata$Sample %in% c("AML314", "AML371", "AML722B", "AML997", "AML475", "AML420B")]
galen_aml_donors_malignant <-
  subset(galen_aml_donors_malignant, Sample %in% selected_samples)
```


```{r}
X_malignant <- galen_aml_donors_malignant@assays$RNA@layers$data
rownames(X_malignant) <- rownames(galen_aml_donors_malignant)
X_malignant <- X_malignant[rf.celltype.outer.used1k,]

y_pred_malignant <- predict(rf.celltype.inner, newdata = t(X_malignant), type = "prob")

y_pred_max <- colnames(y_pred_malignant)[max.col(y_pred_malignant)]
```


There are cells classified as lymphocytes, despite detection of mutated transcripts.
Same for early erythrocytes, pDC. 

```{r}
table(y_pred_max)
table(y_pred_max, galen_aml_donors_malignant$Sample)
t(round(t(
  table(y_pred_max, galen_aml_donors_malignant$Sample)
) / as.vector(colSums(
  table(y_pred_max, galen_aml_donors_malignant$Sample)
)), 2))
```


There are some rather unexpected HSPC cell type predictions: early erythroid cells and pDC cells. While lymphoid cell types are pretty surely a misclassification, early ery and pDC can actually be mutated in AML, according to literature. 

https://www.mdpi.com/2072-6694/14/14/3375
Acute myeloid leukemia (AML) with ≥2% plasmacytoid dendritic cells (pDC) has been recently described as AML with pDC differentiation (pDC-AML) characterized by pDC expansion with frequent RUNX1 mutations.

AML921A has 8% cells predicted as pDC and RUNX1 NM_001754 c.167T>C p.L56S (63.5%, VUS)

In combination with the possible signature combinations in AML cells, I want to keep those predictions. 

However, there are too few cells to use them as classes in a classifier. Instead, I'll include them as features, alongside genes, and predict malignant vs. normal classes. 

I'll only train 1 classifier, instead of inner and outer. 
This way, validation of correct prediction of malignant cells is more truthful. 

```{r}
X_healthy <- galen_aml_donors_healthy@assays$RNA@layers$data
X_healthy <- X_healthy[,!is.na(galen_aml_donors_healthy$backspin_celltype)]
rownames(X_healthy) <- rownames(galen_aml_donors_healthy)
# combine celltype prediction with gene expression
X_healthy_celltype <- rbind(X_healthy, t(y_pred_healthy))

X_malignant <- galen_aml_donors_malignant@assays$RNA@layers$data
rownames(X_malignant) <- rownames(galen_aml_donors_malignant)
# combine celltype prediction with gene expression
X_malignant_celltype <- rbind(X_malignant, t(y_pred_malignant))

X_celltype <- cbind(X_healthy_celltype, X_malignant_celltype)
dim(X_celltype)

X <- cbind(X_healthy, X_malignant)
dim(X)

y = factor(c(rep("healthy", ncol(X_healthy)), rep("malignant", ncol(X_malignant))))
table(y)
```

Keep cells expressed in >= 1% in at least one of the classes. 

```{r}
keep.genes.expressed <- rowSums((X > 0) / ncol(X)) >= 0.01
table(keep.genes.expressed)

# expressed in 1% cells in at least 1 class
classes <- unique(y)
keep.genes.class <- lapply(classes, FUN = function(class) {
  genes <- names(which(rowSums((X[,y == class] > 0) / sum(y == class)) >= 0.01))
  return(genes)
})
keep.genes.class <- rownames(X) %in% unlist(keep.genes.class)
names(keep.genes.class) <- rownames(X)
table(keep.genes.class)
```

```{r}
X <- X[names(which(keep.genes.class)),]
X_celltype <- X_celltype[c(names(which(keep.genes.class)), colnames(y_pred_malignant)), ]
X_celltype_driver <- X_celltype[!rownames(X_celltype) %in% genes_genotypes, ]
```

Train 1 classifier with celltype probabilities as features and one without and one without genes with driver mutations. 

```{r}
set.seed(123)
rf.malignant <- randomForest(
  x = t(X),
  y = y,
  sampsize = rep(200, length(unique(y))),
  ntree = 200,
  do.trace = F
)
```

```{r}
set.seed(123)
rf.malignant.celltype <- randomForest(
  x = t(X_celltype),
  y = y,
  sampsize = rep(200, length(unique(y))),
  ntree = 200,
  do.trace = F
)
```

```{r}
set.seed(123)
rf.malignant.celltype.driver <- randomForest(
  x = t(X_celltype_driver),
  y = y,
  sampsize = rep(200, length(unique(y))),
  ntree = 200,
  do.trace = F
)
```

Including cell type probabilities gives a 

```{r}
rf.malignant
rf.malignant.celltype
rf.malignant.celltype.driver
```

```{r}
y_malignant_celltype_pred <- predict(rf.malignant.celltype, newdata = t(X_celltype), type = "prob")
```

```{r}
y_malignant_celltype_pred_max <- colnames(y_malignant_celltype_pred)[max.col(y_malignant_celltype_pred)]

donor_list <- unique(galen_aml_donors_malignant$Sample)
names(donor_list) <- donor_list

# donor_accuracy_rf.malignant.celltype <- lapply(donor_list, function(donor) sum(y_malignant_celltype_pred_max[donors == donor] == "malignant") / sum(donors == donor))
# donor_accuracy_rf.malignant.celltype
```


### Model evaluation

#### Cross-validation on AML donors

Pick 4 donors at random (to save time).

Accuracy is how many cells are correctly classified as malignant. 

With this cross validation, we could do more paramter testing. Like the number of features to pick for the inner cell type classifier.
Or how many paramters to sample for each decision in the tree. Or the depth of the trees. 

```{r}
donors <- c(galen_aml_donors_healthy$Sample, galen_aml_donors_malignant$Sample)
table(donors)
donors_cv <- unique(galen_aml_donors_malignant$Sample, galen_aml_donors_healthy$Sample)
set.seed(123)
donors_cv <- sample(donors_cv, size = 4)
donors_cv
```

```{r}
cv_results <- lapply(donors_cv, function(donor) {
  cat(donor)
  cat("\n")
  
  # train and validation split
  X_celltype_cv_train <- X_celltype[, donors != donor]
  X_celltype_cv_test <- X_celltype[, donors == donor]
  y_cv_train <- y[donors != donor]
  y_cv_test <- y[donors == donor]
  
  # train on test data
  set.seed(123)
  rf.malignant.celltype.cv <- randomForest(
    x = t(X_celltype_cv_train),
    y = y_cv_train,
    sampsize = rep(200, length(unique(y_cv_train))),
    ntree = 200,
    do.trace = FALSE
  )
  
  # predict validation data
  y_cv_pred <-
    predict(
      rf.malignant.celltype.cv,
      newdata = t(X_celltype_cv_test),
      type = "prob"
    )
  y_cv_pred_max <- colnames(y_cv_pred)[max.col(y_cv_pred)]
  
  # calculate accuracy
  accuracy <- sum(y_cv_pred_max == y_cv_test) / length(y_cv_pred_max)
  return(accuracy)
})

names(cv_results) <- donors_cv
```

Cross validation when leaving out driver mutation genes. 

```{r}
cv_results_driver <- lapply(donors_cv, function(donor) {
  cat(donor)
  cat("\n")
  
  # train and validation split
  X_celltype_cv_train <- X_celltype_driver[, donors != donor]
  X_celltype_cv_test <- X_celltype_driver[, donors == donor]
  y_cv_train <- y[donors != donor]
  y_cv_test <- y[donors == donor]
  
  # train on test data
  set.seed(123)
  rf.malignant.celltype.cv <- randomForest(
    x = t(X_celltype_cv_train),
    y = y_cv_train,
    sampsize = rep(200, length(unique(y_cv_train))),
    ntree = 200,
    do.trace = FALSE
  )
  
  # predict validation data
  y_cv_pred <-
    predict(
      rf.malignant.celltype.cv,
      newdata = t(X_celltype_cv_test),
      type = "prob"
    )
  y_cv_pred_max <- colnames(y_cv_pred)[max.col(y_cv_pred)]
  
  # calculate accuracy
  accuracy <- sum(y_cv_pred_max == y_cv_test) / length(y_cv_pred_max)
  return(accuracy)
})

names(cv_results_driver) <- donors_cv
```

```{r}
donor_metadata %>%
  filter(Sample %in% donors_cv, `Days from diagnosis` == "D0") %>%
  select(Sample, `RHP Mutations`, `Common translocation`) %>%
  unique()
```

```{r}
lapply(cv_results, round, 3)
```

```{r}
lapply(cv_results_driver, round, 3)
```


Excluding genes with driver mutations increases patient CV accuracy slightly in 3/4 cases.



#### Cross-validation on driver mutations

TODO

To test the generalizability of the model on unseen mutations, cross validation should be performed with holding out all samples with a specific driver mutation. 

"To exclude the possibility that the high frequency of cells with detected NPM1 mutations affected the classifier, we generated a separate classifier that does not consider NPM1 mutant calls. This separate classifier had equally high specificity (99.8% of normal cells correctly called normal), and sensitivity (93% of malignant cells correctly called malignant) in 5-fold cross-validation. It is also consistent with the original classifier: 97% of cells originally classified as normal were classified as normal; 91% of cells originally classified as malignant were classified as malignant. These results indicate that the classifier is robust to the frequency of NPM1 mutations in the training set."

Should have put the NPM1 sample in the non-NPM1 trained classifier.

#### External test dataset

scRNA-seq from AML patients (NPM1) with sc genotyping.

Naldini, M.M., Casirati, G., Barcella, M. et al. Longitudinal single-cell profiling of chemotherapy response in acute myeloid leukemia. Nat Commun 14, 1285 (2023). [https://doi.org/10.1038/s41467-023-36969-0](https://www.nature.com/articles/s41467-023-36969-0)

10 patients with NPM1mut AML, present in van Galen dataset.
Majority of NPM1mut AML sampels in van Galen dataset also have DNMT3 mutation. DNMT3 status is not mentioned for this study. 

3 patients with del(7) AML, not present in van Galen data.

*Preprocessing information*
Feature-barcodes filtered matrices from Cell Ranger were used as input for Seurat R package64,65 (version 3.2.3). Seurat objects were merged in a single full dataset. Cells with a mitochondrial count ratio higher than 0.2 and <100 or >7000 expressed genes were removed from the dataset. UMI counts were log normalized and scaled for a factor of 10,000.  
The top 20% most variable genes were selected for downstream analysis. Cell cycle scores were assigned with the CellCycleScoring function using a reference gene lists66. We scaled data and regressed out unwanted variability by passing UMI count, percent of mitochondrial genes and cell cycle difference defined variables to the vars.to.regress argument. Cell cycle difference was defined as the difference between S phase and G/2 M phase module scores. Downstream analysis was performed on the top 100 principal components (PCA). In order to reduce patient-related and 10x chemistry version (v2 vs v3) bias, we performed data integration using the Harmony package (v1.0)24.

*Mutation annotation information*
NPM1-MF considers the UMI counts supporting either NPM1 mutant or WT allele to classify cells as 
MUT (≥1 UMI for the mutant allele)
WT (>5 WT transcripts, no mutant transcripts)
ND – not detected (cells with ≤ 5 WT transcripts and no mutant transcripts)
NoCall (cells without coverage over the NPM1 mutation region)

Check patient metadata.  
Do not provide information on age, Sex or clinincal blast count.

```{r results=F}
metadata_npm1_patients <- readxl::read_xlsx("naldini_et_al/41467_2023_36969_MOESM5_ESM.xlsx")
```


```{r}
metadata_npm1_patients
```


Check cell metadata.  
I can't figure out what the orig.ident (e.g. M03) is. It is the ID used in the expression matrix. So maybe the capture, but it doesn't make much sense to have 9 cells from PT12 and 570 cells from PT13 in M44. Why would there be leakage between captures. 

PT01, PT02, PT13 are primary refractory. 
PT06, PT07, PT12 are long-term complete remission.
PT08, PT09, PT10, PT15 are NPM1mut AML relapse post-chemotherapy.

```{r}
metadata_npm1_cells <- readRDS("naldini_et_al/GSE185991_Full_Patient_metadata.rds")
head(metadata_npm1_cells)
```

Check number of genotyped cells per sample to pick one for testing. 

PT02, combining diagnosis and D30 samples, has a lot of WT and MUT cells, NPM1mut, primary refractory. 
Good patient to test if classifier can extrapolate to other scRNA-seq datasets.

```{r}
metadata_npm1_patients %>%
  mutate(Sample = paste(PatientID, Timepoint, sep = "_")) %>%
  group_by(Sample, Classification) %>%
  summarize(n_cells = sum(n)) %>%
  ggplot(aes(x = Sample, y = n_cells, fill = Classification)) +
  geom_bar(stat = "identity") +
  theme_bw() +
  theme(panel.grid = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1))
```

Captures with PT02 data.
Not sure why there are only 9 cells in the other capture. 

```{r}
metadata_npm1_cells %>%
  filter(PatientID == "PT02") %>%
  pull(orig.ident) %>%
  table()

metadata_npm1_cells %>%
  filter(PatientID == "PT07") %>%
  pull(orig.ident) %>%
  table()
```

Read in count data.

```{r}
naldini_seurat_m25 <- Read10X(
  data.dir = "naldini_et_al/GSE185991_RAW/M25/",
  gene.column = 2,
  cell.column = 1,
  unique.features = TRUE,
  strip.suffix = FALSE
)
colnames(naldini_seurat_m25) <- paste0("M25_", gsub("-1", "", colnames(naldini_seurat_m25)))

naldini_seurat_m26 <- Read10X(
  data.dir = "naldini_et_al/GSE185991_RAW/M26/",
  gene.column = 2,
  cell.column = 1,
  unique.features = TRUE,
  strip.suffix = FALSE
)
colnames(naldini_seurat_m26) <- paste0("M26_", gsub("-1", "", colnames(naldini_seurat_m26)))

# combine captures and merge
naldini_seurat_pt02 <- cbind(naldini_seurat_m25, naldini_seurat_m26)
naldini_seurat_pt02 <- CreateSeuratObject(counts = naldini_seurat_pt02, min.cells = 0, min.features = 200)
ncol(naldini_seurat_pt02)

naldini_seurat_m25 <- NULL
naldini_seurat_m26 <- NULL
gc()

# subset to cells present in metadata
naldini_seurat_pt02 <- subset(naldini_seurat_pt02, cells = rownames(metadata_npm1_cells))
ncol(naldini_seurat_pt02)

# add metadata to Seurat object
naldini_seurat_pt02 <- AddMetaData(naldini_seurat_pt02, metadata_npm1_cells)
table(naldini_seurat_pt02$orig.ident)
```

```{r}
naldini_seurat_m21 <- Read10X(
  data.dir = "naldini_et_al/GSE185991_RAW/M21/",
  gene.column = 2,
  cell.column = 1,
  unique.features = TRUE,
  strip.suffix = FALSE
)
colnames(naldini_seurat_m21) <- paste0("M21_", gsub("-1", "", colnames(naldini_seurat_m21)))

naldini_seurat_m22 <- Read10X(
  data.dir = "naldini_et_al/GSE185991_RAW/M22/",
  gene.column = 2,
  cell.column = 1,
  unique.features = TRUE,
  strip.suffix = FALSE
)
colnames(naldini_seurat_m22) <- paste0("M22_", gsub("-1", "", colnames(naldini_seurat_m22)))

naldini_seurat_m23 <- Read10X(
  data.dir = "naldini_et_al/GSE185991_RAW/M23/",
  gene.column = 2,
  cell.column = 1,
  unique.features = TRUE,
  strip.suffix = FALSE
)
colnames(naldini_seurat_m23) <- paste0("M23_", gsub("-1", "", colnames(naldini_seurat_m23)))

naldini_seurat_m24 <- Read10X(
  data.dir = "naldini_et_al/GSE185991_RAW/M24/",
  gene.column = 2,
  cell.column = 1,
  unique.features = TRUE,
  strip.suffix = FALSE
)
colnames(naldini_seurat_m24) <- paste0("M24_", gsub("-1", "", colnames(naldini_seurat_m24)))

# combine captures and merge
naldini_seurat_pt07 <- cbind(naldini_seurat_m21, naldini_seurat_m22, naldini_seurat_m23, naldini_seurat_m24)
naldini_seurat_pt07 <- CreateSeuratObject(counts = naldini_seurat_pt07, min.cells = 0, min.features = 200)
ncol(naldini_seurat_pt07)

naldini_seurat_m21 <- NULL
naldini_seurat_m22 <- NULL
naldini_seurat_m23 <- NULL
naldini_seurat_m24 <- NULL
gc()

# subset to cells present in metadata
naldini_seurat_pt07 <- subset(naldini_seurat_pt07, cells = rownames(metadata_npm1_cells))
ncol(naldini_seurat_pt07)

# add metadata to Seurat object
naldini_seurat_pt07 <- AddMetaData(naldini_seurat_pt07, metadata_npm1_cells)
table(naldini_seurat_pt07$orig.ident)
```



Filter cells for >3000 UMI, 1000 genes, <10% mitochondrial transcripts PT02.
PT07 seems to have lower quality. Use less stringent filters. 

```{r}
VlnPlot(naldini_seurat_pt02, features = c("nCount_RNA", "nFeature_RNA", "percent.mt"), log = T, pt.size = 0)
VlnPlot(naldini_seurat_pt07, features = c("nCount_RNA", "nFeature_RNA", "percent.mt"), log = T, pt.size = 0)
```

Seems like PT02 is female, PT07 is male. 

```{r}
VlnPlot(naldini_seurat_pt02, features = c("XIST"))
VlnPlot(naldini_seurat_pt07, features = c("XIST"))
```


```{r}
naldini_seurat_pt02 <-
  subset(naldini_seurat_pt02,
         subset = nCount_RNA > 3000 & nFeature_RNA > 1000 & percent.mt < 10)
naldini_seurat_pt07 <-
  subset(naldini_seurat_pt07,
         subset = nCount_RNA > 1000 & nFeature_RNA > 500 & percent.mt < 10)
```


```{r}
table(naldini_seurat_pt02$orig.ident)
table(naldini_seurat_pt07$orig.ident)
```

```{r}
naldini_seurat_pt02_pt07 <-
  merge(naldini_seurat_pt02, naldini_seurat_pt07)
naldini_seurat_pt02_pt07

naldini_seurat_pt02_pt07 <-
  merge(
    naldini_seurat_pt02,
    y = naldini_seurat_pt07,
    add.cell.ids = c("PT02", "PT07"),
    project = "Naldini"
  )
naldini_seurat_pt02_pt07
```

Normalize gene expression. 

```{r}
naldini_seurat_pt02_pt07 <- NormalizeData(naldini_seurat_pt02_pt07)
```


##### Predict HSPC cell type

```{r}
X <-
  cbind(
    naldini_seurat_pt02_pt07@assays$RNA@layers$data.1,
    naldini_seurat_pt02_pt07@assays$RNA@layers$data.2
  )
rownames(X) <- rownames(naldini_seurat_pt02_pt07)
```

Some features used in classifier are missing in data, either due to gene ID mismatches or because the genes are not expressed. 
Both van Galen and Naldini were aligned to hg38, so it is more likely lack of expression. 
Will set missing genes to 0. 

```{r}
features_missing <-
  rownames(rf.celltype.inner$importance)[!rownames(rf.celltype.inner$importance) %in% rownames(X)]

features_missing_mat <-
  matrix(nrow = length(features_missing),
         ncol = ncol(X),
         data = 0)
rownames(features_missing_mat) <- features_missing
colnames(features_missing_mat) <- colnames(X)

X <- rbind(X, features_missing_mat)

X <- X[rownames(rf.celltype.inner$importance), ]
```


```{r}
naldini_seurat_pt02_pt07_celltype_pred <- predict(rf.celltype.inner, newdata = t(X), type = "prob")
```

##### Predict malignant cells

```{r}
X <-
  cbind(
    naldini_seurat_pt02_pt07@assays$RNA@layers$data.1,
    naldini_seurat_pt02_pt07@assays$RNA@layers$data.2
  )
X <- rbind(X, t(naldini_seurat_pt02_pt07_celltype_pred))
rownames(X) <-
  c(
    rownames(naldini_seurat_pt02_pt07),
    colnames(naldini_seurat_pt02_pt07_celltype_pred)
  )

features_missing <-
  rownames(rf.malignant.celltype$importance)[!rownames(rf.malignant.celltype$importance) %in% rownames(X)]
features_missing_mat <-
  matrix(nrow = length(features_missing),
         ncol = ncol(X),
         data = 0)
rownames(features_missing_mat) <- features_missing
colnames(features_missing_mat) <- colnames(X)

X <- rbind(X, features_missing_mat)
```


```{r}
naldini_seurat_pt02_pt07_malignant_pred <-
  predict(rf.malignant.celltype,
          newdata = t(X),
          type = "prob")

naldini_seurat_pt02_pt07_malignant_pred_max <-
  colnames(naldini_seurat_pt02_pt07_malignant_pred)[max.col(naldini_seurat_pt02_pt07_malignant_pred)]
```


##### Results

Compare malignant vs. healthy prediction with genotype data. 

```{r}
naldini_seurat_pt02_malignant_pred_max <- naldini_seurat_pt02_pt07_malignant_pred_max[naldini_seurat_pt02_pt07$PatientID == "PT02"]
table(
  naldini_seurat_pt02_malignant_pred_max,
  naldini_seurat_pt02@meta.data$Classification
)

round(t(t(
  table(
    naldini_seurat_pt02_malignant_pred_max,
    naldini_seurat_pt02@meta.data$Classification
  )
) / as.vector(
  table(naldini_seurat_pt02@meta.data$Classification)
)), 3)
```

```{r}
naldini_seurat_pt07_malignant_pred_max <-
  naldini_seurat_pt02_pt07_malignant_pred_max[naldini_seurat_pt02_pt07$PatientID == "PT07"]
table(
  naldini_seurat_pt07_malignant_pred_max,
  naldini_seurat_pt07@meta.data$Classification
)

round(t(t(
  table(
    naldini_seurat_pt07_malignant_pred_max,
    naldini_seurat_pt07@meta.data$Classification
  )
) / as.vector(
  table(naldini_seurat_pt07@meta.data$Classification)
)), 3)
```


Test on sample with a mutation not present in van Galen data. 

del(7) AML: deletion in chromosome 7. 

Classification of del(7) cells into malignant/wt:  
"We leveraged the AddModuleScore function for evaluating the expression level of a Chr 7 signature by using as input gene list all genes located on it. The observed distribution of Chr 7 module scores in the datasets followed a bimodal distribution allowing us to classify cells as AML or non-AML by running a k-means clustering (n = 2) on the vector of Chr 7 signature scores and labelling cells in the high score group as non-AML and those in the low score group as AML."

Unfortunately, not included in metadata.

PT11, PT17: refractory disease; PT18: early relapse

```{r}
metadata_del7_cells <- readRDS("naldini_et_al/GSE185991_AML_Del7_metadata.rds")

table(metadata_del7_cells$orig.ident, metadata_del7_cells$PatientID)
```

Computing module score with all genes on chr7 might be very slow. 

Download cellranger gtf reference, get all genes on chr7.

```{sh eval=FALSE}
source="/Users/holzehenrietta/Documents/personal/phd_applications/muds_task_marr/reference_sources"
mkdir -p "$source"
gtf_url="http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.primary_assembly.annotation.gtf.gz"
gtf_in="${source}/gencode.v44.primary_assembly.annotation.gtf"

if [ ! -f "$gtf_in" ]; then
    curl -sS "$gtf_url" | zcat > "$gtf_in"
fi

grep "^chr7" $gtf_in | awk '$3 == "gene"' | sed 's/.*gene_name "\([^"]*\)".*/\1/g' > ${source}/gencode.v44_gene_names_chr7.txt

grep "^chrY" $gtf_in | awk '$3 == "gene"' | sed 's/.*gene_name "\([^"]*\)".*/\1/g' > ${source}/gencode.v44_gene_names_chrY.txt
```

```{r}
chr7_genes <-
  read_lines(
    "/Users/holzehenrietta/Documents/personal/phd_applications/muds_task_marr/reference_sources/gencode.v44_gene_names_chr7.txt"
  )
```

Subset to genes expressed in >= 10% of cells in del(7) scRNA-seq data to speed up analysis.

```{r}

```


## Save Session Info

```{r}
writeLines(capture.output(sessionInfo()), paste0(Sys.Date(), "_sessionInfo.txt"))
```

